<a href="https://colab.research.google.com/github/Kalina95/MachineLearningCourse/blob/main/01_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installing and import libriaries

When you're working on local files, you'll propably need to install libs on yopur own. Please use commands:



```
!pip install sickit-learn
!pip install numpy
!pip install pandas
```



In [21]:
import numpy as np
import pandas as pd
import sklearn

# Data Generation

## DataFrame Preparation


1. Creating `data` object as dictionary.
2. Creating pandas dataframe based on this object.
3. Copying dataframe to other object to have some "backup"


> `df.info()` Gives us information about columns i dataframe. It's easy to see here, if there are some null values in data set.
This command shows us also data type in column. It's good practice to refactor data type `object` to categorized data type. It's much more effective for visualization.

> `df.copy()` creates copy of dataframe.

In [22]:
data = {
    'size': ['XL', 'L', 'M', 'L', 'M'],
    'color': ['red', 'green', 'blue', 'green', 'red'],
    'gender': ['female', 'male', 'male', 'female', 'female'],
    'price': [199.0, 89.0, 99.0, 129.0, 79.0],
    'weight': [500, 450, 300, 380, 410],
    'bought': ['yes', 'no', 'yes', 'no', 'yes']
}

df_raw = pd.DataFrame(data=data)

df = df_raw.copy()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   size    5 non-null      object 
 1   color   5 non-null      object 
 2   gender  5 non-null      object 
 3   price   5 non-null      float64
 4   weight  5 non-null      int64  
 5   bought  5 non-null      object 
dtypes: float64(1), int64(1), object(4)
memory usage: 368.0+ bytes


## Data Preprocessing - Label



We'd like to transofmr `object` datatype to categorized datatype - `true/false` or `1/0`

In [23]:
for column in ['size', 'color', 'gender', 'bought']:
  df[column] = df[column].astype('category')

df['weight'] = df['weight'].astype('float')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   size    5 non-null      category
 1   color   5 non-null      category
 2   gender  5 non-null      category
 3   price   5 non-null      float64 
 4   weight  5 non-null      float64 
 5   bought  5 non-null      category
dtypes: category(4), float64(2)
memory usage: 740.0 bytes


we can use `df.describe()` to see core statistics for our frame. But we cannot see here `categorized` data.

to transpone this dataframe we need to use .T option on df

In [24]:
df.describe()

Unnamed: 0,price,weight
count,5.0,5.0
mean,119.0,408.0
std,48.476799,75.299402
min,79.0,300.0
25%,89.0,380.0
50%,99.0,410.0
75%,129.0,450.0
max,199.0,500.0


In [25]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
price,5.0,119.0,48.476799,79.0,89.0,99.0,129.0,199.0
weight,5.0,408.0,75.299402,300.0,380.0,410.0,450.0,500.0


In [26]:
df.describe(include=['category']).T

Unnamed: 0,count,unique,top,freq
size,5,3,L,2
color,5,3,green,2
gender,5,2,female,3
bought,5,2,yes,3


In [27]:
df

Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199.0,500.0,yes
1,L,green,male,89.0,450.0,no
2,M,blue,male,99.0,300.0,yes
3,L,green,female,129.0,380.0,no
4,M,red,female,79.0,410.0,yes


Mapping label - yes/no values to 1 or 0 values.

`labelEncoder.fit(df['bought'])` + `labelEncoder.transform(df['bought'])` = `labelEncoder.fit_transform(df['bought'])`

labelEncoder.inverse_transform(df['bought']) # reversed fit_transform


In [29]:
from sklearn.preprocessing import LabelEncoder

labelEncoder = LabelEncoder()
#labelEncoder.fit(df['bought'])
#labelEncoder.transform(df['bought'])
#labelEncoder.fit_transform(df['bought'])

df['bought'] = labelEncoder.fit_transform(df['bought'])
#labelEncoder.inverse_transform(df['bought']) # reversed fit_transform
df


Unnamed: 0,size,color,gender,price,weight,bought
0,XL,red,female,199.0,500.0,1
1,L,green,male,89.0,450.0,0
2,M,blue,male,99.0,300.0,1
3,L,green,female,129.0,380.0,0
4,M,red,female,79.0,410.0,1


## Data Preprocessing - Categories

In [31]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
encoder.fit(df['size'])

TypeError: OneHotEncoder.__init__() got an unexpected keyword argument 'sparse'