#### Data Wrangling 1

#### Perform the following operations using Python on any open-source dataset (e.g., data.csv)

#### 1.1 Import all the required Python Libraries.
#### 1.2 Locate an open-source data from the web (e.g. https://www.kaggle.com). Provide a clear description of the data and its source (i.e., URL of the web site).
#### 1.3 Load the Dataset into pandas’ data frame.
#### 1.4 Data Preprocessing: check for missing values in the data using pandas isnull (), describe() function to get some initial statistics. Provide variable descriptions. Types of variables etc. Check the dimensions of the data frame.
#### 1.5 Data Formatting and Data Normalization: Summarize the types of variables by checking the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the data set. If variables are not in the correct data type, apply proper type conversions.
#### 1.6 Turn categorical variables into quantitative variables in Python.
#### In addition to the codes and outputs, explain every operation that you do in the above steps and explain everything that you do to import/read/scrape the data set.

#### 1.1 Import all the required Python Libraries

In [37]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#### 1.2 Locate an open-source data from the web (e.g. https://www.kaggle.com). Provide a clear description of the data and its source (i.e., URL of the web site). 
#### 1.3 Load the Dataset into pandas’ data frame.

In [38]:
df = pd.read_csv("nba.csv");
df

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6-2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6-6,235.0,Marquette,6796117.0
2,John Holland,Boston Celtics,30.0,SG,27.0,6-5,205.0,Boston University,
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6-5,185.0,Georgia State,1148640.0
4,Jonas Jerebko,Boston Celtics,8.0,PF,29.0,6-10,231.0,,5000000.0
...,...,...,...,...,...,...,...,...,...
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6-3,203.0,Butler,2433333.0
454,Raul Neto,Utah Jazz,25.0,PG,24.0,6-1,179.0,,900000.0
455,Tibor Pleiss,Utah Jazz,21.0,C,26.0,7-3,256.0,,2900000.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7-0,231.0,Kansas,947276.0


#### This is the dataset containing information of the NBA Players
#### Source : https://media.geeksforgeeks.org/wp-content/uploads/nba.csv
#### Positions : PG = Point Guard SG = Shooting Guard SF = Small Forward PF = Power Forward

#### 1.4 Data Preprocessing: check for missing values in the data using pandas isnull (), describe() function to get some initial statistics. Provide variable descriptions. Types of variables etc. Check the dimensions of the data frame.

#### 1.4.1 : checking for missing values in the data using pandas isnull() function :

In [39]:
df.isnull()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,True
3,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...
453,False,False,False,False,False,False,False,False,False
454,False,False,False,False,False,False,False,True,False
455,False,False,False,False,False,False,False,True,False
456,False,False,False,False,False,False,False,False,False


#### 1.4.2 : describe() function :

In [40]:
df.describe()

Unnamed: 0,Number,Age,Weight,Salary
count,457.0,457.0,457.0,446.0
mean,17.678337,26.938731,221.522976,4842684.0
std,15.96609,4.404016,26.368343,5229238.0
min,0.0,19.0,161.0,30888.0
25%,5.0,24.0,200.0,1044792.0
50%,13.0,26.0,220.0,2839073.0
75%,25.0,30.0,240.0,6500000.0
max,99.0,40.0,307.0,25000000.0


#### 1.4.3 Variable description :

#### Name : Name of NBA Player
#### Team : Team of NBA Player
#### Number : Jersey Number of Player
#### Positions : PG = Point Guard SG = Shooting Guard SF = Small Forward PF = Power Forward
#### Age : Age of NBA Player
#### Height : Height of NBA Player
#### Weight : Weight of NBA Player
#### College : College of NBA Player
#### Salary : Salary of NBA Player

#### 1.4.4 Types of Variable :

In [41]:
df.dtypes

Name         object
Team         object
Number      float64
Position     object
Age         float64
Height       object
Weight      float64
College      object
Salary      float64
dtype: object

#### 1.4.5 Dimensions of Data Frame :

In [42]:
df.shape

(458, 9)

In [43]:
df.index

RangeIndex(start=0, stop=458, step=1)

#### 1.5 Data Formatting and Data Normalization: Summarize the types of variables by checking the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the data set. If variables are not in the correct data type, apply proper type conversions.


In [44]:
df.dtypes

Name         object
Team         object
Number      float64
Position     object
Age         float64
Height       object
Weight      float64
College      object
Salary      float64
dtype: object

#### Height should be float,  Name should be string
#### College should be string, Team should be string, Position should be string but keeping it as object for encoding purpose

In [45]:
df.dropna(inplace = True)

In [46]:
df.isnull().sum()

Name        0
Team        0
Number      0
Position    0
Age         0
Height      0
Weight      0
College     0
Salary      0
dtype: int64

In [47]:
df["Name"] = df["Name"].astype("string");
# df["Team"] = df["Team"].astype("string");
# df["Position"] = df["Position"].astype("string");
# df["College"] = df["College"].astype("string");
df["Height"] = df["Height"].str.replace('-','.');
df["Height"] = df["Height"].astype(float);
df.dtypes

Name        string[python]
Team                object
Number             float64
Position            object
Age                float64
Height             float64
Weight             float64
College             object
Salary             float64
dtype: object

#### Data Normalization (Min Max Scaling):

In [48]:
num_col = df.select_dtypes(include = ['int64','float64']).columns
df_manual = df.copy()

for val in num_col:
    min_val = df[val].min()
    max_val = df[val].max()
    df_manual[val] = (df[val] - min_val) / (max_val - min_val)

In [49]:
df_manual

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.000000,PG,0.285714,0.521531,0.161017,Texas,0.336322
1,Jae Crowder,Boston Celtics,1.000000,SF,0.285714,0.712919,0.627119,Marquette,0.295382
3,R.J. Hunter,Boston Celtics,0.282828,SG,0.142857,0.665072,0.203390,Georgia State,0.047895
6,Jordan Mickey,Boston Celtics,0.555556,PF,0.095238,0.808612,0.627119,LSU,0.048873
7,Kelly Olynyk,Boston Celtics,0.414141,C,0.285714,0.904306,0.652542,Gonzaga,0.092441
...,...,...,...,...,...,...,...,...,...
449,Rodney Hood,Utah Jazz,0.050505,SG,0.190476,0.808612,0.381356,Duke,0.056650
451,Chris Johnson,Utah Jazz,0.232323,SF,0.333333,0.712919,0.381356,Dayton,0.040563
452,Trey Lyles,Utah Jazz,0.414141,PF,0.047619,0.473684,0.618644,Kentucky,0.095712
453,Shelvin Mack,Utah Jazz,0.080808,PG,0.333333,0.569378,0.355932,Butler,0.104193


#### 1.6 Turn categorical variables into quantitative variables in Python(Using Label Encoding and one hot encoding)

In [50]:
df.dtypes

Name        string[python]
Team                object
Number             float64
Position            object
Age                float64
Height             float64
Weight             float64
College             object
Salary             float64
dtype: object

#### Label Encoding

In [53]:
from sklearn.preprocessing import LabelEncoder
cat_cols = df.select_dtypes(include = ['category','object']).columns
le = LabelEncoder()
df_le = df.copy()

for val in cat_cols:
    df_le[val] = le.fit_transform(df[val].astype(str));

df_le.head()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,1,0.0,2,25.0,6.2,180.0,90,7730337.0
1,Jae Crowder,1,99.0,3,25.0,6.6,235.0,50,6796117.0
3,R.J. Hunter,1,28.0,4,22.0,6.5,185.0,31,1148640.0
6,Jordan Mickey,1,55.0,1,21.0,6.8,235.0,44,1170960.0
7,Kelly Olynyk,1,41.0,0,25.0,7.0,238.0,33,2165160.0


In [55]:
df.dtypes


Name        string[python]
Team                object
Number             float64
Position            object
Age                float64
Height             float64
Weight             float64
College             object
Salary             float64
dtype: object

#### One hot Encoding:

In [57]:
df_ohe = pd.get_dummies(df,columns = cat_cols);
df_ohe.head()

Unnamed: 0,Name,Number,Age,Height,Weight,Salary,Team_Atlanta Hawks,Team_Boston Celtics,Team_Brooklyn Nets,Team_Charlotte Hornets,...,College_Washington State,College_Weber State,College_Westchester CC,College_Western Carolina,College_Western Kentucky,College_Western Michigan,College_Wichita State,College_Wisconsin,College_Wyoming,College_Xavier
0,Avery Bradley,0.0,25.0,6.2,180.0,7730337.0,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
1,Jae Crowder,99.0,25.0,6.6,235.0,6796117.0,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
3,R.J. Hunter,28.0,22.0,6.5,185.0,1148640.0,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
6,Jordan Mickey,55.0,21.0,6.8,235.0,1170960.0,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
7,Kelly Olynyk,41.0,25.0,7.0,238.0,2165160.0,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False


#### Extra Functions:

In [58]:
df.head(11)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6.2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6.6,235.0,Marquette,6796117.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6.5,185.0,Georgia State,1148640.0
6,Jordan Mickey,Boston Celtics,55.0,PF,21.0,6.8,235.0,LSU,1170960.0
7,Kelly Olynyk,Boston Celtics,41.0,C,25.0,7.0,238.0,Gonzaga,2165160.0
8,Terry Rozier,Boston Celtics,12.0,PG,22.0,6.2,190.0,Louisville,1824360.0
9,Marcus Smart,Boston Celtics,36.0,PG,22.0,6.4,220.0,Oklahoma State,3431040.0
10,Jared Sullinger,Boston Celtics,7.0,C,24.0,6.9,260.0,Ohio State,2569260.0
11,Isaiah Thomas,Boston Celtics,4.0,PG,27.0,5.9,185.0,Washington,6912869.0
12,Evan Turner,Boston Celtics,11.0,SG,27.0,6.7,220.0,Ohio State,3425510.0


In [59]:
df.tail()

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
449,Rodney Hood,Utah Jazz,5.0,SG,23.0,6.8,206.0,Duke,1348440.0
451,Chris Johnson,Utah Jazz,23.0,SF,26.0,6.6,206.0,Dayton,981348.0
452,Trey Lyles,Utah Jazz,41.0,PF,20.0,6.1,234.0,Kentucky,2239800.0
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6.3,203.0,Butler,2433333.0
456,Jeff Withey,Utah Jazz,24.0,C,26.0,7.0,231.0,Kansas,947276.0


In [60]:
df_sorted = df.sort_values(by = 'Name')
df_sorted

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
152,Aaron Brooks,Chicago Bulls,0.0,PG,31.0,6.0,161.0,Oregon,2250000.0
356,Aaron Gordon,Orlando Magic,0.0,PF,20.0,6.9,220.0,Arizona,4171680.0
328,Aaron Harrison,Charlotte Hornets,9.0,SG,21.0,6.6,210.0,Kentucky,525093.0
404,Adreian Payne,Minnesota Timberwolves,33.0,PF,25.0,6.1,237.0,Michigan State,1938840.0
312,Al Horford,Atlanta Hawks,15.0,C,30.0,6.1,245.0,Florida,12000000.0
...,...,...,...,...,...,...,...,...,...
141,Willie Cauley-Stein,Sacramento Kings,0.0,C,22.0,7.0,240.0,Kentucky,3398280.0
25,Willie Reed,Brooklyn Nets,33.0,PF,26.0,6.1,220.0,Saint Louis,947276.0
386,Wilson Chandler,Denver Nuggets,21.0,SF,29.0,6.8,225.0,DePaul,10449438.0
402,Zach LaVine,Minnesota Timberwolves,8.0,PG,21.0,6.5,189.0,UCLA,2148360.0


In [62]:
df['Name']

0      Avery Bradley
1        Jae Crowder
3        R.J. Hunter
6      Jordan Mickey
7       Kelly Olynyk
           ...      
449      Rodney Hood
451    Chris Johnson
452       Trey Lyles
453     Shelvin Mack
456      Jeff Withey
Name: Name, Length: 364, dtype: string

In [64]:
df[['Name','Position']]

Unnamed: 0,Name,Position
0,Avery Bradley,PG
1,Jae Crowder,SF
3,R.J. Hunter,SG
6,Jordan Mickey,PF
7,Kelly Olynyk,C
...,...,...
449,Rodney Hood,SG
451,Chris Johnson,SF
452,Trey Lyles,PF
453,Shelvin Mack,PG


In [65]:
df[0:3]

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6.2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6.6,235.0,Marquette,6796117.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6.5,185.0,Georgia State,1148640.0


In [66]:
df.sample(4)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
145,Duje Dukan,Sacramento Kings,26.0,PF,24.0,6.9,220.0,Wisconsin,525093.0
303,Kevin Martin,San Antonio Spurs,23.0,SG,33.0,6.7,199.0,Western Carolina,200600.0
163,Bobby Portis,Chicago Bulls,5.0,PF,21.0,6.11,230.0,Arkansas,1391160.0
179,Tristan Thompson,Cleveland Cavaliers,13.0,C,25.0,6.9,238.0,Texas,14260870.0


In [67]:
df.T

Unnamed: 0,0,1,3,6,7,8,9,10,11,12,...,442,443,444,446,448,449,451,452,453,456
Name,Avery Bradley,Jae Crowder,R.J. Hunter,Jordan Mickey,Kelly Olynyk,Terry Rozier,Marcus Smart,Jared Sullinger,Isaiah Thomas,Evan Turner,...,Trevor Booker,Trey Burke,Alec Burks,Derrick Favors,Gordon Hayward,Rodney Hood,Chris Johnson,Trey Lyles,Shelvin Mack,Jeff Withey
Team,Boston Celtics,Boston Celtics,Boston Celtics,Boston Celtics,Boston Celtics,Boston Celtics,Boston Celtics,Boston Celtics,Boston Celtics,Boston Celtics,...,Utah Jazz,Utah Jazz,Utah Jazz,Utah Jazz,Utah Jazz,Utah Jazz,Utah Jazz,Utah Jazz,Utah Jazz,Utah Jazz
Number,0.0,99.0,28.0,55.0,41.0,12.0,36.0,7.0,4.0,11.0,...,33.0,3.0,10.0,15.0,20.0,5.0,23.0,41.0,8.0,24.0
Position,PG,SF,SG,PF,C,PG,PG,C,PG,SG,...,PF,PG,SG,PF,SF,SG,SF,PF,PG,C
Age,25.0,25.0,22.0,21.0,25.0,22.0,22.0,24.0,27.0,27.0,...,28.0,23.0,24.0,24.0,26.0,23.0,26.0,20.0,26.0,26.0
Height,6.2,6.6,6.5,6.8,7.0,6.2,6.4,6.9,5.9,6.7,...,6.8,6.1,6.6,6.1,6.8,6.8,6.6,6.1,6.3,7.0
Weight,180.0,235.0,185.0,235.0,238.0,190.0,220.0,260.0,185.0,220.0,...,228.0,191.0,214.0,265.0,226.0,206.0,206.0,234.0,203.0,231.0
College,Texas,Marquette,Georgia State,LSU,Gonzaga,Louisville,Oklahoma State,Ohio State,Washington,Ohio State,...,Clemson,Michigan,Colorado,Georgia Tech,Butler,Duke,Dayton,Kentucky,Butler,Kansas
Salary,7730337.0,6796117.0,1148640.0,1170960.0,2165160.0,1824360.0,3431040.0,2569260.0,6912869.0,3425510.0,...,4775000.0,2658240.0,9463484.0,12000000.0,15409570.0,1348440.0,981348.0,2239800.0,2433333.0,947276.0


In [68]:
df.iloc[12:14,3:7]

Unnamed: 0,Position,Age,Height,Weight
16,SG,24.0,6.3,190.0
17,SG,28.0,6.4,200.0


In [70]:
d = df
d.fillna(value = -1)

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6.2,180.0,Texas,7730337.0
1,Jae Crowder,Boston Celtics,99.0,SF,25.0,6.6,235.0,Marquette,6796117.0
3,R.J. Hunter,Boston Celtics,28.0,SG,22.0,6.5,185.0,Georgia State,1148640.0
6,Jordan Mickey,Boston Celtics,55.0,PF,21.0,6.8,235.0,LSU,1170960.0
7,Kelly Olynyk,Boston Celtics,41.0,C,25.0,7.0,238.0,Gonzaga,2165160.0
...,...,...,...,...,...,...,...,...,...
449,Rodney Hood,Utah Jazz,5.0,SG,23.0,6.8,206.0,Duke,1348440.0
451,Chris Johnson,Utah Jazz,23.0,SF,26.0,6.6,206.0,Dayton,981348.0
452,Trey Lyles,Utah Jazz,41.0,PF,20.0,6.1,234.0,Kentucky,2239800.0
453,Shelvin Mack,Utah Jazz,8.0,PG,26.0,6.3,203.0,Butler,2433333.0


In [71]:
filterdata = df[df['College'] == 'Texas']
filterdata

Unnamed: 0,Name,Team,Number,Position,Age,Height,Weight,College,Salary
0,Avery Bradley,Boston Celtics,0.0,PG,25.0,6.2,180.0,Texas,7730337.0
66,Cory Joseph,Toronto Raptors,6.0,PG,24.0,6.3,190.0,Texas,7000000.0
133,P.J. Tucker,Phoenix Suns,17.0,SF,31.0,6.6,245.0,Texas,5500000.0
179,Tristan Thompson,Cleveland Cavaliers,13.0,C,25.0,6.9,238.0,Texas,14260870.0
208,Myles Turner,Indiana Pacers,33.0,PF,20.0,6.11,243.0,Texas,2357760.0
289,Jordan Hamilton,New Orleans Pelicans,25.0,SG,25.0,6.7,220.0,Texas,1015421.0
294,LaMarcus Aldridge,San Antonio Spurs,12.0,PF,30.0,6.11,240.0,Texas,19689000.0
384,D.J. Augustin,Denver Nuggets,12.0,PG,28.0,6.0,183.0,Texas,3000000.0
414,Kevin Durant,Oklahoma City Thunder,35.0,SF,27.0,6.9,240.0,Texas,20158622.0
