<a href="https://colab.research.google.com/github/Highashikata/The-Data-King-Tips/blob/main/The_Data_King_Tips.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **The Data King - Tips & Tricks**

In this notebook, we will go through all the data science tips and tricks that are useful to code fastly or to better lead a good data analysis process.


PS : This ntebook is inspired from Kevin Markham the Founder of Data School. 

### **Sklearn Tips**

#### **Tip 1 : Use ColumnTransformer to apply different preprocessing to different columns**: ☁

In [1]:
import pandas as pd
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=6)

In [2]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [12]:
cols = ['Fare', 'Embarked', 'Sex', 'Age']
X = df[cols]
X

Unnamed: 0,Fare,Embarked,Sex,Age
0,7.25,S,male,22.0
1,71.2833,C,female,38.0
2,7.925,S,female,26.0
3,53.1,S,female,35.0
4,8.05,S,male,35.0
5,8.4583,Q,male,


In [11]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer

In [13]:
ohe = OneHotEncoder()
imp = SimpleImputer()

In [14]:
from numpy import remainder
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (imp, ['Age']),
    remainder = 'passthrough'
)

In [16]:
ct.fit_transform(X)

array([[ 0.    ,  0.    ,  1.    ,  0.    ,  1.    , 22.    ,  7.25  ],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    , 38.    , 71.2833],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    , 26.    ,  7.925 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    , 35.    , 53.1   ],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    , 35.    ,  8.05  ],
       [ 0.    ,  1.    ,  0.    ,  0.    ,  1.    , 31.2   ,  8.4583]])

#### **Tip 2 : Use ColumnTransformer to Select multiple columns based on different criteria, to apply a transformer**

In [24]:
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector

In [33]:
ct = make_column_transformer((ohe, ['Embarked', 'Sex']))
ct = make_column_transformer((ohe, [1, 2]))
ct = make_column_transformer((ohe, slice(1, 3)))
ct = make_column_transformer((ohe, [False, True, True, False]))
ct = make_column_transformer((ohe, make_column_selector(pattern ='E|S')))
ct = make_column_transformer((ohe, make_column_selector(dtype_include = object)))
ct = make_column_transformer((ohe, make_column_selector(dtype_exclude = 'number')))

#### **Tip 3 : A quick tip about fit and transform**




 - fit: method is when the transformer learns something about the Data.
 - transform : is when it applies the transformation to the actual Data.

#### **Tip 4 : A quick tip about fit and transform** When to use fit_transform and transform.

- We use `fit_transform` 
when dealing with the training data for example the data resulting from the `train_test_split`.
- We use the `transform` only on testing and new data.




#### **Tip 5 : Encoding Categorical Data (Unordered VS Ordered) Data.**



- Ordianl Data : Ordered Data. `OrdinalEncoder`
- Nominal Data : Unordered Data. `OneHotEncoder`

In [34]:
import pandas as pd
X = pd.DataFrame({'Shape':['square', 'square', 'oval', 'circle'],
                  'Class': ['third', 'first', 'second', 'third'],
                  'Size': ['S', 'S', 'L', 'XL']})

In [35]:
X

Unnamed: 0,Shape,Class,Size
0,square,third,S
1,square,first,S
2,oval,second,L
3,circle,third,XL


In [36]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

In [37]:
ohe = OneHotEncoder(sparse = False)
ohe.fit_transform(X[['Shape']])

array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.]])

In [42]:
oe = OrdinalEncoder(categories = [['first', 'second', 'third'], ['S', 'M', 'L', 'XL']])
oe.fit_transform(X[['Class', 'Size']])

array([[2., 0.],
       [0., 0.],
       [1., 2.],
       [2., 3.]])