# Advanced Python 
##   **Data Wrangling**

----

 ### **Topics**
**1. Feature Transformation of Datasets**
 * Renaming Axis (X and Y) and Columns names
 * Dealing with Missing Values NaN

## 1. Feature Transformation of Datasets

In [2]:
# import libraries
import numpy as np
import pandas as pd
import seaborn as sns

In [3]:
# import dataset
data= sns.load_dataset('titanic')
data.head(10) # show first 5 rows

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False


### **Renaming Axis (X and Y) and Columns names**

In [4]:
# how to Remane AXIS  
# Renaming Column-Axis (X-axis)
data=data.rename_axis("COLUMNS", axis="columns")
# Renaming Row-Axis (Y-axis)
data=data.rename_axis("ROWS", axis="rows")
data.head()

COLUMNS,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
ROWS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [5]:
# Rename Columns names
data= data.rename(columns={"sex":"gender"}) # rename one column

### **Dealing with missing vales (NaN, zero, black cells)**

> #### steps 
- step 01 - Try to collect data again (if possible)
- step 02 -  Remove missing value data column if your data is not going to effect with removing conmplt variable
- step 03 - Replace the missing vale
   * HOW?
      1. - add average value of entire variable into missing data cell
      2. - replace based on other functions (data sampler, )
      3. - ML algprathim can also be used (to find unknown value )
      4. - leave it as it is  (NaN, zero, emply , blank )
   * WHY?
      1. - Its better becaause no data is lost 
      2. - less accurate 

In [6]:
# check the shape of the dataset
rows, columns = data.shape  # get the number of rows and columns
print("Number of rows: ", rows)
print("Number of columns: ", columns)

Number of rows:  891
Number of columns:  15


In [7]:
# our dataset info
data.info()
# as you see there are 891 rows and 15 columns in which some columns have missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   gender       891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [8]:
# lets cheak the missing values in percentage
data.isnull().sum()/len(data)*100
# more than 77% of the data in deck column is missing so we will drop this column because it is not useful for us

COLUMNS
survived        0.000000
pclass          0.000000
gender          0.000000
age            19.865320
sibsp           0.000000
parch           0.000000
fare            0.000000
embarked        0.224467
class           0.000000
who             0.000000
adult_male      0.000000
deck           77.216611
embark_town     0.224467
alive           0.000000
alone           0.000000
dtype: float64

#### step 02 -  Remove missing value data column if your data is not going to effect with removing conmplt variable

In [9]:
# Drop Method on perticular Columns 
data. drop(['deck'], axis=1, inplace=True) 
data.head()


COLUMNS,survived,pclass,gender,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
ROWS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,Southampton,no,True


In [10]:
# cheak the missing values in the dataset
data.isnull().sum()
# as you see there are 177 missing values in age column and 2 missing values in embarked column
# we can drop these 177+2 NaN value or we can fill them by any value (like mean of that coloumn)


COLUMNS
survived         0
pclass           0
gender           0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
embark_town      2
alive            0
alone            0
dtype: int64

#### step 03 - Replace the missing vale

In [None]:
# Replacing method 01 : filling NaN values with text information
# Fill Method: It tries to fill the NaN values
data["age"].fillna("Age not defined") # filled with string "age not defined"
data.tail(10)

In [None]:
# Replace Method 02 : It tries to fill NaN by replacing it with a value
data.replace(to_replace=np.nan, value=0) # replace with 0

In [23]:
# replace method 03 : It tries to fill NaN by replacing it with a value of average of that column
#finding an average mean of age column
mean=data["age"].mean()
print (mean)

#Replace NaN of ages with mean data of ages
data["age"]=data["age"].replace(np.nan, mean) # replace Nan with mean


data.isnull().sum() # check the missing values in the dataset after updating

29.69911764705882


COLUMNS
survived       0
pclass         0
gender         0
age            0
sibsp          0
parch          0
fare           0
embarked       2
class          0
who            0
adult_male     0
embark_town    2
alive          0
alone          0
dtype: int64

In [24]:
# now in embarked column there are only 2 missing values so we can drop these 2 rows 
data.dropna(inplace=True) # drop the rows with NaN values
data.isnull().sum() # check the missing values in the dataset after updating

COLUMNS
survived       0
pclass         0
gender         0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

In [25]:
data.shape # check the shape of the dataset after updating

(889, 14)

#  **finally data is ready for analysis and visualization**

----------- 

### Other methods to replace Nan values in datasets 

In [None]:
# import libraries
import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
# import dataset
data= sns.load_dataset('titanic')
data.head(10) # show first 5 rows

#### Replacing missing value by different ways 

In [None]:
# Method 01 - Interpolate missing values
# Interpolate method is used to fill the missing values in the dataset by using the values of the other rows in the dataset

data["age"] = data["age"].interpolate(method='linear', limit_direction='forward', axis=0)
data.head()

In [None]:
# Method 02 - Imputer method from sklearn library
# Imputer method is used to fill the missing values in the dataset by using the values of the other rows in the dataset
# import libraries
from sklearn.impute import SimpleImputer

# create an object of the imputer class 
X = data[["age"]]

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean').fit(X) 
data["age"] = imp_mean.transform(X).astype("int64") 
data.head()