<table class="table table-bordered">
    <tr>
        <th style="text-align:center; width:25%"><img src='https://www.np.edu.sg/PublishingImages/Pages/default/odp/ICT.jpg' style="width: 250px; height: 125px; "></th>
        <th style="text-align:center;"><h1>Deep Learning</h1><h2>Practical 3a - Data Processing Using Pandas</h2><h3>AY2020/21 Semester</h3></th>
    </tr>
</table>

## Objectives
After completing this practical exercise, students should be able to:
1. [Understand the basics of Pandas for data processing tasks](#demo)
2. [Exercise: Practise data processing on a different dataset](#exc)

## 1. Pandas <a id='demo' />
This is a short introduction on Pandas Package. For more details, please refer to a 10 minutes Pandas tutorial at: https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html.

We will be using two csv files (`nba.csv` and `Players.csv`) in this Practical. You can download both files from MEL and save them at the same folder as this Practical File (.ipynb). 

In [1]:
import numpy as np
import pandas as pd

In [2]:
#Load the csv file to a DataFrame variable df
df = pd.read_csv('nba.csv')
print(df)

                        Name                    Team  Number Position  Age  \
0              Avery Bradley          Boston Celtics       0       PG   25   
1                Jae Crowder          Boston Celtics      99       SF   25   
2               John Holland          Boston Celtics      30       SG   27   
3                R.J. Hunter          Boston Celtics      28       SG   22   
4              Jonas Jerebko          Boston Celtics       8       PF   29   
5               Amir Johnson          Boston Celtics      90       PF   29   
6              Jordan Mickey          Boston Celtics      55       PF   21   
7               Kelly Olynyk          Boston Celtics      41        C   25   
8               Terry Rozier          Boston Celtics      12       PG   22   
9               Marcus Smart          Boston Celtics      36       PG   22   
10           Jared Sullinger          Boston Celtics       7        C   24   
11             Isaiah Thomas          Boston Celtics       4    

In [3]:
# Display the type of data for each column
df.dtypes

Name         object
Team         object
Number        int64
Position     object
Age           int64
Height       object
Weight        int64
College      object
Salary      float64
dtype: object

In [4]:
# Display the first few rows of data
df.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0,PG,25,06-02,180,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99,SF,25,06-06,235,Marquette,6796117.0
2,John Holland,Boston Celtics,30,SG,27,06-05,205,Boston University,
3,R.J. Hunter,Boston Celtics,28,SG,22,06-05,185,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8,PF,29,06-10,231,,5000000.0


In [5]:
# Drop the NA records
df.dropna(inplace = True) 
df.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0,PG,25,06-02,180,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99,SF,25,06-06,235,Marquette,6796117.0
3,R.J. Hunter,Boston Celtics,28,SG,22,06-05,185,Georgia State,1148640.0
6,Jordan Mickey,Boston Celtics,55,PF,21,06-08,235,LSU,1170960.0
7,Kelly Olynyk,Boston Celtics,41,C,25,7-0,238,Gonzaga,2165160.0


In [6]:
# Display the index
df.index

Int64Index([  0,   1,   3,   6,   7,   8,   9,  10,  11,  12,
            ...
            442, 443, 444, 446, 448, 449, 451, 452, 453, 456],
           dtype='int64', length=364)

In [7]:
# Display the columns
df.columns

Index(['Name', 'Team', 'Number', 'Position', 'Age', 'Height', 'Weight',
       'College', 'Salary'],
      dtype='object')

In [8]:
# shows a quick statistic summary of your data (only for the numerical data columns)
df.describe()

Unnamed: 0,Number,Age,Weight,Salary
count,364.0,364.0,364.0,364.0
mean,16.82967,26.615385,219.785714,4620311.0
std,14.994162,4.233591,24.793099,5119716.0
min,0.0,19.0,161.0,55722.0
25%,5.0,24.0,200.0,1000000.0
50%,12.0,26.0,220.0,2515440.0
75%,25.0,29.0,240.0,6149694.0
max,99.0,40.0,279.0,22875000.0


In [9]:
# select mutiple columns (numerical data columns)
df2=df.loc[:,['Number','Age','Weight','Salary']]
print(df2.head())

   Number  Age  Weight     Salary
0       0   25     180  7730337.0
1      99   25     235  6796117.0
3      28   22     185  1148640.0
6      55   21     235  1170960.0
7      41   25     238  2165160.0


In [10]:
# convert the DataFrame to a Numpy Array
array= df2.values # option 1
array= df2.to_numpy() # option 2: New in version 0.24.0.

print(array)
print(array.shape)

[[0.000000e+00 2.500000e+01 1.800000e+02 7.730337e+06]
 [9.900000e+01 2.500000e+01 2.350000e+02 6.796117e+06]
 [2.800000e+01 2.200000e+01 1.850000e+02 1.148640e+06]
 ...
 [4.100000e+01 2.000000e+01 2.340000e+02 2.239800e+06]
 [8.000000e+00 2.600000e+01 2.030000e+02 2.433333e+06]
 [2.400000e+01 2.600000e+01 2.310000e+02 9.472760e+05]]
(364, 4)


In [11]:
# convert Numpy Array to DataFrame
df3=pd.DataFrame(array)
df3.head()

Unnamed: 0,0,1,2,3
0,0.0,25.0,180.0,7730337.0
1,99.0,25.0,235.0,6796117.0
2,28.0,22.0,185.0,1148640.0
3,55.0,21.0,235.0,1170960.0
4,41.0,25.0,238.0,2165160.0


In [12]:
# convert Numpy Array to DataFrame with column names indicated
df3=pd.DataFrame(array, columns =['Number','Age','Weight','Salary'])
df3.head()

Unnamed: 0,Number,Age,Weight,Salary
0,0.0,25.0,180.0,7730337.0
1,99.0,25.0,235.0,6796117.0
2,28.0,22.0,185.0,1148640.0
3,55.0,21.0,235.0,1170960.0
4,41.0,25.0,238.0,2165160.0


In [13]:
# export DataFrame to a csv file
df3.to_csv('nba_new.csv')

## 2. Exercise <a id='exc' />
Load the data from `Players.csv` and complete the below tasks using what you learned from this practical.

In [14]:
# Task 1: Load the csv file 'Players.csv' to a DataFrame variable df
df = pd.read_csv('Players.csv')
print(df.dtypes)
print(df)

Unnamed: 0       int64
Player          object
height         float64
weight         float64
collage         object
born           float64
birth_city      object
birth_state     object
dtype: object
      Unnamed: 0                 Player  height  weight  \
0              0        Curly Armstrong   180.0    77.0   
1              1           Cliff Barker   188.0    83.0   
2              2          Leo Barnhorst   193.0    86.0   
3              3             Ed Bartels   196.0    88.0   
4              4            Ralph Beard   178.0    79.0   
5              5             Gene Berce   180.0    79.0   
6              6          Charlie Black   196.0    90.0   
7              7            Nelson Bobb   183.0    77.0   
8              8        Jake Bornheimer   196.0    90.0   
9              9           Vince Boryla   196.0    95.0   
10            10              Don Boven   193.0    95.0   
11            11          Harry Boykoff   208.0   102.0   
12            12            Joe Bra

In [15]:
# Task 2: Clean up the data (if required) and select all the numeric columns & assign to a new DataFrame df2
df.dropna(inplace = True) 
df2=df.loc[:,['height','weight','born']]

In [16]:
# Task 3: Convert df2 to a Numpy Array
array= df2.values 


In [17]:
# Task 4: Convert Numpy Array to DataFrame df3 with column names indicated
df3=pd.DataFrame(array)
df3=pd.DataFrame(array, columns =['height','weight','born'])

In [18]:
# Task 5: Export df3 to a new csv file (Players_new.csv)
df3.to_csv('Players_new.csv')

