# Task 30-> Some preprocessing Using scikit-learn

#######
Preprocessing is a crucial step in preparing your data for machine learning models. Using scikit-learn, 
you can scale features with StandardScaler, encode categorical variables with OneHotEncoder, and impute 
missing values with SimpleImputer. For example, numerical features can be standardized to have zero mean
and unit variance, while categorical features can be converted into binary variables. Once 
preprocessing is complete, you can apply various machine learning models from sklearn like Linear 
Regression, Logistic Regression, and Random Forest. To optimize these models, perform hyperparameter
tuning using techniques such as Grid Search and Random Search. Apply these models to your datasets and 
note down the results based on the techniques you applied to evaluate their performance. 

### importing necessary libraries and dataset

In [22]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, mean_squared_error, r2_score
from scipy.stats import uniform, randint



file_path = r'C:\Users\Huawei\Desktop\Titanic-Dataset.csv'
data = pd.read_csv(file_path)

### features and target variable

In [4]:
features = data.drop(columns=['Survived'])
target = data['Survived']

### numerical and categorical columns

In [7]:
numerical_cols = features.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = features.select_dtypes(include=['object']).columns

### Separate numerical and categorical features

In [8]:
X_numerical = features[numerical_cols]
X_categorical = features[categorical_cols]

### Initialize the imputer and scaler for numerical features

In [9]:
imputer_num = SimpleImputer(strategy='mean')#missing values will be filled with mean of the respective column
scaler = StandardScaler()

### Fit and transform the numerical data

In [10]:
X_numerical_imputed = imputer_num.fit_transform(X_numerical)
X_numerical_scaled = scaler.fit_transform(X_numerical_imputed)

### Initialize the imputer and encoder for categorical features

In [11]:
imputer_cat = SimpleImputer(strategy='most_frequent')#missing values will be filled with mode(most_frequent) of the respective column
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)#handle_unknown='ignore' means ignore any unknown categories 
#sparse_output=False means encoded data is returned as regular array rather than in such a format that saves space by only storing non_zero values.

### Fit and transform the categorical data

In [12]:
X_categorical_imputed = imputer_cat.fit_transform(X_categorical)
X_categorical_encoded = encoder.fit_transform(X_categorical_imputed)

### Combine numerical and categorical features

In [13]:
X_preprocessed = np.hstack((X_numerical_scaled, X_categorical_encoded))

### Split data into training and test sets

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, target, test_size=0.2, random_state=42)

### Train a RandomForest model for comparison

In [15]:
model_rf = RandomForestClassifier(random_state=42)

### Hyperparameter tuning for RandomForest using grid search

In [23]:
param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30]
}
grid_search_rf = GridSearchCV(model_rf, param_grid_rf, cv=4, scoring='accuracy')
grid_search_rf.fit(X_train, y_train)
print("Best parameters from RandomForest Grid Search: ", grid_search_rf.best_params_)
print("Best accuracy from RandomForest Grid Search: ", grid_search_rf.best_score_*100)

Best parameters from RandomForest Grid Search:  {'max_depth': None, 'n_estimators': 50}
Best accuracy from RandomForest Grid Search:  75.0


### Hyperparameter tuning for RandomForest using random search

In [17]:
model_rf = RandomForestClassifier(random_state=42)
param_dist_rf = {
    'n_estimators': randint(50, 201), 
    'max_depth': [None, 10, 20, 30]
}
random_search_rf = RandomizedSearchCV(estimator=model_rf, param_distributions=param_dist_rf, n_iter=10, cv=4, scoring='accuracy', random_state=42, n_jobs=-1)
random_search_rf.fit(X_train, y_train)
print("Best parameters from RandomForest Random Search: ", random_search_rf.best_params_)
print("Best accuracy from RandomForest Random Search: ", random_search_rf.best_score_ * 100)

Best parameters from RandomForest Random Search:  {'max_depth': 20, 'n_estimators': 142}
Best accuracy from RandomForest Random Search:  87.5


### Hyperparameter tuning for random forest model evaluation
### . The RandomForest model from Random Search, with parameters {'max_depth': 20, 'n_estimators': 142}, achieved a higher accuracy of 87.5% compared to the Grid Search model's accuracy of 75%. 
### . This indicates that the Random Search approach with its parameter tuning performed significantly better due to better optimization of hyperparameters.

### Hyperparameter tuning for Logistic Regression using grid search

In [18]:
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],  
    'solver': ['liblinear', 'lbfgs'],  
}
model = LogisticRegression(max_iter=1000)
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=4, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)
print("Best parameters from Grid Search:", grid_search.best_params_)
print("Best accuracy from Grid Search: {:.2f}".format(grid_search.best_score_ * 100))


Best parameters from Grid Search: {'C': 0.01, 'solver': 'liblinear'}
Best accuracy from Grid Search: 62.50


### Hyperparameter tuning for Logistic Regression using random search

In [19]:
param_dist = {
    'C': uniform(0.01, 100), 
    'solver': ['liblinear', 'lbfgs'],  
}
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_dist, n_iter=10, cv=4, scoring='accuracy', random_state=42, n_jobs=-1)
random_search.fit(X_train, y_train)
print("Best parameters from Randomized Search:", random_search.best_params_)
print("Best accuracy from Randomized Search: {:.2f}".format(random_search.best_score_ * 100))

Best parameters from Randomized Search: {'C': 37.464011884736244, 'solver': 'liblinear'}
Best accuracy from Randomized Search: 62.50


### Hyperparameter tuning for Logistic Regression model evaluation
### . Both Grid Search and Randomized Search achieved the same accuracy of 62.5%. Despite the Randomized Search identifying a higher value for C(i.e Regularization Strength) (37.46).
### . It did not improve the model's performance over the Grid Search, suggesting that the accuracy is not significantly influenced by the choice of hyperparameters in this case.

### Hyperparameter tuning for Linear Regression using grid search

In [20]:
#Grid Search is not applicable for LinearRegression as it has no hyperparameters, For demonstration purposes, I proceed without parameters

linear_model = LinearRegression()
grid_search = GridSearchCV(estimator=linear_model, param_grid={}, cv=4, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search.fit(X_train, y_train)
best_model_grid = grid_search.best_estimator_
y_pred_grid = best_model_grid.predict(X_test)
mse_grid = mean_squared_error(y_test, y_pred_grid)
r2_grid = r2_score(y_test, y_pred_grid)
print("Linear Regression Mean Squared Error (Grid Search): ", mse_grid*100)
print("Linear Regression R^2 Score (Grid Search): ", r2_grid*100)




Linear Regression Mean Squared Error (Grid Search):  2.14589065023884
Linear Regression R^2 Score (Grid Search):  0.0


### Hyperparameter tuning for Linear Regression using random search

In [21]:
# Random Search is not applicable for LinearRegression as it has no hyperparameters, For demonstration purposes, I proceed without parameters


random_search = RandomizedSearchCV(estimator=linear_model, param_distributions={}, n_iter=1, cv=4, scoring='neg_mean_squared_error', random_state=42, n_jobs=-1)
random_search.fit(X_train, y_train)
best_model_random = random_search.best_estimator_
y_pred_random = best_model_random.predict(X_test)
mse_random = mean_squared_error(y_test, y_pred_random)
r2_random = r2_score(y_test, y_pred_random)
print("Linear Regression Mean Squared Error (Random Search): ", mse_random*100)
print("Linear Regression R^2 Score (Random Search): ", r2_random*100)

Linear Regression Mean Squared Error (Random Search):  2.14589065023884
Linear Regression R^2 Score (Random Search):  0.0


### Hyperparameter tuning for Linear Regression model evaluation
### . Both Grid Search and Randomized Search for Linear Regression produced identical Mean Squared Error (MSE) and R² scores i.e MSE: 2.146 and R²: 0.0 . 
### . This indicates that neither hyperparameter search method improved the model's performance. 