# Data Preprocessing Tools

### Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Import data set with pandas

In [None]:
dataset = pd.read_csv('C:/Users/Acer/Desktop/PythonNotes/Python/DataSetsPython/data1.csv')
x = dataset.iloc[:,:-1].values   #independent variables
y = dataset.iloc[:,-1].values    #Dependent variable
dataset

The "iloc" command stands for index location, and ":" idicates range. Range includes lower limit whereas excludes the upper limit. The "-1" stands for the last column, it is excluded in the x dataset and included exclusively in the Y dataset.

### Solution for Missing Data

In [None]:
missing_values = dataset.isnull().sum()
print(missing_values) #Number of missing values on each column

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean') #object/strategy to deal with missing values
imputer.fit(x[:,1:3])    #"fit" method use to apply the object to the columns specified
x[:,1:3]=imputer.transform(x[:,1:3]) #"transform" method to update our columns

Instead of deleting the rows containing missing values, we simply apply an object that replaces the missing value for the average value from each column.

### Categorical Data

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])],
                       remainder='passthrough')
x = np.array(ct.fit_transform(x))

Independent variable: First, we change the 3 countries into categorical data. The "country" column was transform into 3 new columns, each one takes the value 1 for its respective country. 
Usefull to directly transform a multi-categorical label. 

In [None]:
print(x)

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y=le.fit_transform(y)

Dependent variable: We changed the "yes/no" column for a "1/0" column. 
Usefull to directly enncode a binary outcome from a two classes label.

In [None]:
print(y)

Lets take another example

In [None]:
# Importing the necessary libraries
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
import pandas as pd 
import numpy as np 
# Load the dataset
titanic_dataset = pd.read_csv('C:/Users/Acer/Desktop/PythonNotes/Python/DataSetsPython/titanic.csv')

# Identify the categorical data
categorical_features = ['Sex', 'Embarked', 'Pclass']
# Implement an instance of the ColumnTransformer class
ct = ColumnTransformer( transformers=[ ('encoder', OneHotEncoder(), categorical_features) ],remainder='passthrough' )
# Apply the fit_transform method on the instance of ColumnTransformer
X = ct.fit_transform(titanic_dataset)
# Convert the output into a NumPy array
X = np.array(X)
# Use LabelEncoder to encode binary categorical data
le = LabelEncoder()
Y = le.fit_transform(titanic_dataset['Survived'])
# Print the updated matrix of features and the dependent variable vector
print(X)


In [None]:
print(Y)

# Splitting Data

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
X_train, X_test, y_train, y_test 

In [None]:
print(X_train)

In [None]:
print(X_test)

In [None]:
print(y_train)

In [None]:
print(y_test)

# Feature Scalling 

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])

We used the same scaler from the feature training set for the feature test set. We can't make a new one, otherwise the model would be different.

Also, it is worth mentioning that we didn't apply feature scaling to the dependent variable because it already had the values 0 or 1.

In [None]:
print(X_train)

In [None]:
print(X_test)

Another example

In [None]:
wine_dataset = pd.read_csv('C:/Users/Acer/Desktop/PythonNotes/Python/DataSetsPython/winequality-red.csv') # watch out for delimiters (delimiter=';')
# Separate features and target
wx = wine_dataset.iloc[:,:-1].values   
wy = wine_dataset.iloc[:,-1].values   

# Split the dataset into an 80-20 training-test set
wX_train, wX_test, wy_train, wy_test = train_test_split(wx, wy, test_size=0.2, random_state=42)


# Create an instance of the StandardScaler class
sc = StandardScaler()

# Fit the StandardScaler on the features from the training set and transform it
wX_train = sc.fit_transform(wX_train)

# Apply the transform to the test set
wX_test = sc.transform(wX_test)

# Print the scaled training and test datasets
print(wX_train)
print(wX_test)