#### Instructions:
##### Running a coding exercise for the first time could take a bit of time for everything to load. Be patient, it could take a few minutes.

##### When you see ____ in a coding exercise, replace it with what you assume to be the correct code. Run it and see if you obtain the desired output. Submit your code to validate if you were correct.

##### Make sure you remove the hash (#) symbol in the coding portions of this question. We have commented them so that the line won’t execute and you can test your code after each step.

##### Let’s take a look at a modified version of our basketball player dataset.

##### First, let’s take a look at if and/or where we are missing any values.

#### Tasks:

##### Use .describe() or .info() to find if there are any values missing from the dataset.
##### Using some of the skills we learned in the previous course find the number of rows that contains missing values and save the total number of examples with missing values in an object named num_nan.
##### Hint: .any(axis=1) may come in handy here.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

##### Loading in the data

In [2]:
bball_df = pd.read_csv('bball_imp.csv')
bball_df = bball_df[(bball_df['position'] =='G') | (bball_df['position'] =='F')]

##### Define X and y

In [3]:
X = bball_df.loc[:, ['height', 'weight', 'salary']]
y = bball_df['position']

##### Split the dataset

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=7)

##### Explore the missing data in the training features

In [16]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 245 entries, 381 to 251
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   height  208 non-null    float64
 1   weight  233 non-null    float64
 2   salary  238 non-null    float64
dtypes: float64(3)
memory usage: 7.7 KB


In [6]:
X_train.isnull().sum()

height    37
weight    12
salary     7
dtype: int64

##### Calculate the number of examples with missing values

In [8]:
num_nan = X_train.isnull().any(axis=1).sum()
num_nan

56

##### Now that we’ve identified the columns with missing values, let’s use SimpleImputer to replace the missing value.

#### tasks:

##### Import the necessary library.
##### Using SimpleImputer, replace the null values in the training and testing dataset with the median value in each column.
##### Save your transformed data in objects named train_X_imp and test_X_imp respectively.
##### Transform X_train_imp into a dataframe using the column and index labels from X_train and save it as X_train_imp_df.
##### Check if X_train_imp_df still has missing values.

In [9]:
from sklearn.impute import SimpleImputer

##### Fill in the missing values using imputation

In [10]:
imputer = SimpleImputer(strategy='median')
imputer.fit(X_train);
X_train_imp = imputer.transform(X_train)
X_test_imp = imputer.transform(X_test)

##### Transform X_train_imp into a dataframe using the column and index labels from X_train

In [14]:
columns = list(X_train.columns)
X_train_imp_df = pd.DataFrame(X_train_imp, columns=columns)
X_train_imp_df

Unnamed: 0,height,weight,salary
0,1.98,115.7,898310.0
1,1.97,82.1,19500000.0
2,1.98,86.6,2625717.0
3,1.83,79.4,1845301.0
4,1.96,84.4,4466723.0
...,...,...,...
240,2.03,111.1,1897800.0
241,1.96,81.6,3952920.0
242,1.97,88.5,38199000.0
243,1.98,91.2,1882867.0


##### Check if your training set still has missing values 

In [15]:
X_train_imp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 245 entries, 0 to 244
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   height  245 non-null    float64
 1   weight  245 non-null    float64
 2   salary  245 non-null    float64
dtypes: float64(3)
memory usage: 5.9 KB
