# <center>DSAI2201 - Assignment II</center> 
## <center>Due: Novemver 25th, 2023 at 10:00 PM </center>


Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menu bar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menu bar, select Cell$\rightarrow$Run All).

Make sure that in addition to the code, you provide written answers for all questions of the assignment. 

Below, please fill in your name and collaborators:

- NAME: Almabrouk Ben-Omran
- STUDENT_ID: 60104920
- SECTION: 4
- INSTRUCTOR: Dr. Somaiyeh Mahmoudzadeh

# Assignment 2 - Machine Learning Models for Prediction
**(30 points total)**

In Assignments 1 & 2 we will go through the entire journey of a small data science project. 
* In **Assignment 1**, we have explored the data, cleaned up the data, modified features, and created new ones. 
* In **Assignment 2**, we will apply supervised machine learning models for classification and regression, evaluate its perofrmance, and identify the best models to solve the following problems: 

    * The **classification problem** is: given a train dataset of passengers who survived or did not survive the Titanic disaster, build a model which can determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not. 

    * The **regression problem** is: predict the fare of a certain passenger based on known relevant feature values.

* You will use the same train.csv data which you have prepared in Assignment 1.

**Question 1. (Data preparation)**  _(5 points)_
* List the relevant features which you will use for classification and explain your answer (*a relevant feature is a feature that can have an impact on the chance of survival of the passenger*).
* List the relevant features which you will use for regression and explain your answer (*a relevant feature is a feature that can have an impact on the prediction of the fare of a certain passenger*).
* Divide your data into a training set (70%) and a testing set (30%). All models will be trained and tested on the same splits.
    

**Question 2. (Classification models)**  _(10 points)_
* Train three different classification models of your choice using the training set. Explain the rationale behind selecting each of these three algorithms. You may refer to the following guidlines for model selection: 
    * Diagram from scikit-learn: https://scikit-learn.org/stable/tutorial/machine_learning_map/
    * Models comparison table: https://docs.google.com/spreadsheets/d/16i47Wmjpj8k-mFRk-NnXXU5tmSQz8h37YxluDV8Zy9U/edit#gid=0

**Question 3. (Evaluation of classification models)**  _(5 points)_
* Evaluate the performance of your three classification models on the testing set using the following metrics: accuracy, area under the curve (AUC), precision, and recall.
* Based on the models evaluation results, what is the best model and why?


**Question 4. (Regression models)**  _(5 points)_
* Train two different regression models of your choice using the training set. Explain the rationale behind selecting each of these two algorithms. 

**Question 5. (Evaluation of regression models)**  _(3 points)_
* Evaluate the performance of your two regression models on the testing set using the following metrics: mean absolute error,mean squared error, and R-square.
* Based on the models evaluation results, what is the best model and why?

**Question 6. (Possible improvements)** _(2 points)_
* How can you improve the accuracy of your classification model?
* How can you improve the accuracy of your regression model?



# Q1:

### Relevant features impacting chance of survival:
- Age: The "Age" feature is likely to be relevant as it can provide insights into the demographic characteristics of passengers. Certain age groups, such as children or the elderly, might have had different survival rates during the Titanic disaster. For example, there may have been a priority to evacuate children and older individuals, impacting the likelihood of survival.

- Sex: "Sex" is a highly relevant feature for predicting survival. Historical data from the Titanic disaster indicates that there was a strong association between gender and survival rates. Women and children were often given priority when it came to evacuating the ship, resulting in a higher likelihood of survival for females.

- AgeRange: The "AgeRange" feature, categorizing passengers into groups like 'Child' or 'Adult', is vital for predicting Titanic survival. It adds depth to age-related insights, reflecting how certain age groups were prioritized during evacuation. This categorical representation enhances model interpretability, simplifying the understanding of age factors impacting survival. Including "Age Range" accommodates historical trends, aiding the model in discerning distinct survival rates among different age groups and capturing potential non-linear relationships. In essence, "Age Range" is a key predictor, offering nuanced information that influences the likelihood of survival.

### Relevant features impacting fare prediction:
- Age: The "Age" feature is likely relevant as it captures the age of the passenger. Different age groups may have different travel preferences, and fares might vary based on factors associated with age. For instance, children or elderly passengers might receive discounted fares, impacting the overall fare prediction.

- Pclass: Pclass" is highly relevant as it represents the ticket class, categorizing passengers into first class (1st), second class (2nd), and third class (3rd). Ticket class is a strong indicator of socioeconomic status, and higher-class tickets are generally associated with higher fares. Including "Pclass" allows the model to account for variations in fare based on the level of service associated with each class.

- Sibsp: The "Sibsp" feature, indicating the number of siblings or spouses aboard, can be relevant for fare prediction. Passengers traveling with family members may opt for different accommodations or fare packages. The presence of siblings or spouses might influence fare decisions, and including this feature enables the model to capture such variations.

- Parch: Similar to "Sibsp," the "Parch" feature, representing the number of parents or children aboard, is relevant for capturing family-related dynamics influencing fare choices. Passengers traveling with parents or children may have different fare considerations, and incorporating "Parch" provides insights into these familial factors impacting fare prediction. 

In [101]:
#Importing all the necessary libraries 

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

In [86]:
#Loading the previously cleaned dataset from my first assignment into a dataframe

data = pd.read_csv("cleaned_data.csv")
data.set_index("PassengerId", inplace=True)

data

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,AgeRange,NameTitle,NewAge,NewAgeRange
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.2500,Young Adult,Mr,22.0,"(16.336, 32.252]"
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,Adult,Mrs,38.0,"(32.252, 48.168]"
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.9250,Young Adult,Miss,26.0,"(16.336, 32.252]"
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1000,Adult,Mrs,35.0,"(32.252, 48.168]"
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.0500,Adult,Mr,35.0,"(32.252, 48.168]"
...,...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,13.0000,Young Adult,Rev,27.0,"(16.336, 32.252]"
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,30.0000,Young Adult,Miss,19.0,"(16.336, 32.252]"
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,30.0,1,2,23.4500,Young Adult,Miss,30.0,"(16.336, 32.252]"
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,30.0000,Young Adult,Mr,26.0,"(16.336, 32.252]"


In [87]:
#Converting the gender binary classification from categorical to numerical to be used and understood by the machine learning models

def fun(gender): 
    if gender == "female":
        return 1
    elif gender == "male":
        return 0
    
data["Sex"] = data["Sex"].apply(fun)

In [88]:
data["Sex"]

PassengerId
1      0
2      1
3      1
4      1
5      0
      ..
887    0
888    1
889    1
890    0
891    0
Name: Sex, Length: 891, dtype: int64

In [89]:
'''Now, I will map each age category to a numerical value to be used and understood by the machine learning models and drop the columns that won't be needed for 
our classification and regression predictions.'''  

data["AgeRange"] = data["AgeRange"].map({'Child': 1, "Teenager":2, "Young Adult": 3, "Old Man": 4, "Adult":  5})
data = data.drop(["Name","NameTitle", "NewAgeRange", "NewAge"], axis = 1)

data

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,AgeRange
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,0,3,0,22.0,1,0,7.2500,3
2,1,1,1,38.0,1,0,71.2833,5
3,1,3,1,26.0,0,0,7.9250,3
4,1,1,1,35.0,1,0,53.1000,5
5,0,3,0,35.0,0,0,8.0500,5
...,...,...,...,...,...,...,...,...
887,0,2,0,27.0,0,0,13.0000,3
888,1,1,1,19.0,0,0,30.0000,3
889,0,3,1,30.0,1,2,23.4500,3
890,1,1,0,26.0,0,0,30.0000,3


In [90]:
'''Converting the 'Pclass', 'Sex', 'Survived', and 'AgeRange' columns to categorical datatypes to allow the machine learning models to easily interpret and infer 
the columns' values to allow for accurate predictions.'''   

data["Pclass"] = pd.Categorical(data["Pclass"])
data["Sex"] = pd.Categorical(data["Pclass"])
data["Survived"] = pd.Categorical(data["Survived"])
data["AgeRange"] = pd.Categorical(data["AgeRange"])

data.dtypes

Survived    category
Pclass      category
Sex         category
Age          float64
SibSp          int64
Parch          int64
Fare         float64
AgeRange    category
dtype: object

# Q2 & 3:

In [91]:
'''Using the 'Sex', 'Age', and 'AgeRange' features as our predictors and the 'Survived' column as the target column that we want the model to predict.'''

predictors = data.loc[:, data.columns.isin(['Sex', 'Age', 'AgeRange'])]
target = data["Survived"]

In [92]:
'''For each model, we split 70% of the data to train the model and the other 30% to test the model. Also, a for loop is used to iterate through multiple random 
state values and find the one that yields the highest accuracy for each model.'''

logreg_best_random_state = 0
logreg_best_accuracy = 0

random_state_values = range(1, 1001)

### Classification Machine Learning Model #1: Logistic Regression

In [93]:
'''The rationale for using a Logistic Regression model is that it is a commonly used algorithm for binary classification problems, which is suitable for 
predicting outcomes with two classes (survived or not survived). It models the probability of the target class and is relatively simple and interpretable, 
making it a good starting point for classification tasks.'''

for random_state in random_state_values:
    
    predictors_train, predictors_test, target_train, target_test = train_test_split(predictors, target, test_size = 0.3, random_state = random_state)
    logreg = LogisticRegression()
    logreg.fit(predictors_train, target_train)
    logreg_prediction = logreg.predict(predictors_test)
    logreg_current_accuracy = accuracy_score(target_test, logreg_prediction)
    
    if logreg_current_accuracy > logreg_best_accuracy:
        logreg_best_accuracy = logreg_current_accuracy
        logreg_best_random_state = random_state
    
print(f"Logistic Regression Model's Best Random State is {logreg_best_random_state} yielding an accuracy of {logreg_best_accuracy}")

Logistic Regression Model's Best Random State is 258 yielding an accuracy of 0.7910447761194029


In [94]:
'''Setting the Logistic Regression model's best random state as the split's random state + Calculating the Logistic Regression model's accuracy score, precision
score, recall score, and area under the curve score.'''

predictors_train, predictors_test, target_train, target_test = train_test_split(predictors, target, test_size = 0.3, random_state = logreg_best_random_state)

logreg = LogisticRegression()
logreg.fit(predictors_train, target_train)
logreg_prediction = logreg.predict(predictors_test)

logreg_acc = accuracy_score(target_test, logreg_prediction, normalize = True)
logreg_prec = precision_score(target_test, logreg_prediction)
logreg_rec = recall_score(target_test, logreg_prediction)
logreg_auc = metrics.roc_auc_score(target_test, logreg_prediction)

print("Logistic Regression Model Accuracy:", logreg_acc)
print("Logistic Regression Model Area Under the Curve:", logreg_prec)
print("Logistic Regression Model Precision:", logreg_rec)
print("Logistic Regression Model Recall:", logreg_auc)

Logistic Regression Model Accuracy: 0.7910447761194029
Logistic Regression Model Area Under the Curve: 0.7536231884057971
Logistic Regression Model Precision: 0.5714285714285714
Logistic Regression Model Recall: 0.7376916868442291


In [95]:
#Adding the Logistic Regression model's metrics into a dataframe to improve metric visualization 

logreg_model_performance = pd.DataFrame(["Logistic Regression", logreg_acc, logreg_prec, logreg_rec, logreg_auc]).transpose()
logreg_model_performance.columns = ["Classification Model", "Accuracy", "Precision", "Recall", "Area Under the Curve"]

logreg_model_performance

Unnamed: 0,Classification Model,Accuracy,Precision,Recall,Area Under the Curve
0,Logistic Regression,0.791045,0.753623,0.571429,0.737692


In [96]:
svm_best_random_state = 0
svm_best_accuracy = 0

### Classification Machine Learning Model #2: Support Vector Machine

In [97]:
'''The rationale for using a Support Vector Machine model is that it is a powerful algorithm for binary classification. It works well in high-dimensional spaces
and is effective when there is a clear margin of separation between classes. SVM can capture complex relationships and is less sensitive to outliers. It might 
be suitable for distinguishing between passengers who survived and those who did not.'''

for random_state in random_state_values:
    
    predictors_train, predictors_test, target_train, target_test = train_test_split(predictors, target, test_size = 0.3, random_state = random_state)
    svm = SVC(kernel='linear')
    svm.fit(predictors_train, target_train)
    svm_prediction = svm.predict(predictors_test)
    svm_current_accuracy = accuracy_score(target_test, svm_prediction)
    
    if svm_current_accuracy > svm_best_accuracy:
        svm_best_accuracy = svm_current_accuracy
        svm_best_random_state = random_state
    
print(f"Support Vector Machine Model's Best Random State is {svm_best_random_state} yielding an accuracy of {svm_best_accuracy}")

Support Vector Machine Model's Best Random State is 931 yielding an accuracy of 0.7574626865671642


In [100]:
'''Setting the Support Vector Machine model's best random state as the split's random state + Calculating the Support Vector Machine model's accuracy score,
precision score, recall score, and area under the curve score.'''

predictors_train, predictors_test, target_train, target_test = train_test_split(predictors, target, test_size = 0.3, random_state = svm_best_random_state)

scaler = StandardScaler()
predictors_train_scaled = scaler.fit_transform(predictors_train)
predictors_test_scaled = scaler.transform(predictors_test)

svm = SVC(kernel='linear')
svm.fit(predictors_train_scaled, target_train)
svm_prediction = svm.predict(predictors_test_scaled)

svm_acc = accuracy_score(target_test, svm_prediction, normalize = True)
svm_prec = precision_score(target_test, svm_prediction)
svm_rec = recall_score(target_test, svm_prediction)
svm_auc = metrics.roc_auc_score(target_test, svm_prediction)

print("Support Vector Machine Model Accuracy (Post-Scaling):", svm_acc)
print("Support Vector Machine Model Area Under the Curve (Post-Scaling):", svm_prec)
print("Support Vector Machine Model Precision (Post-Scaling):", svm_rec)
print("Support Vector Machine Model Recall (Post-Scaling):", svm_auc)

Support Vector Machine Model Accuracy (Post-Scaling): 0.7611940298507462
Support Vector Machine Model Area Under the Curve (Post-Scaling): 0.7571428571428571
Support Vector Machine Model Precision (Post-Scaling): 0.53
Support Vector Machine Model Recall (Post-Scaling): 0.714404761904762


In [99]:
#Adding the Support Vector Machine model's metrics into a dataframe to improve metric visualization 

svm_model_performance = pd.DataFrame(["Support Vector Machine", svm_acc, svm_prec, svm_rec, svm_auc]).transpose()
svm_model_performance.columns = ["Classification Model", "Accuracy", "Precision", "Recall", "Area Under the Curve"]

svm_model_performance

Unnamed: 0,Classification Model,Accuracy,Precision,Recall,Area Under the Curve
0,Support Vector Machine,0.761194,0.757143,0.53,0.714405


In [74]:
rf_class_best_random_state = 0
rf_class_best_accuracy = 0

### Classification Machine Learning Model #3: Random Forest Classifier

In [75]:
'''Random Forest is an ensemble learning method that can handle both classification and regression tasks. It's robust, handles non-linear relationships well,
and is less prone to overfitting. Random Forest is a good choice for the Titanic dataset as it can capture complex interactions between features and handle 
missing values effectively.'''

for random_state in random_state_values:
    
    predictors_train, predictors_test, target_train, target_test = train_test_split(predictors, target, test_size = 0.3, random_state = random_state)
    rf_class = RandomForestClassifier()
    rf_class.fit(predictors_train, target_train)
    rf_class_prediction = rf_class.predict(predictors_test)
    rf_class_current_accuracy = accuracy_score(target_test, rf_class_prediction)
    
    if rf_class_current_accuracy > rf_class_best_accuracy:
        rf_class_best_accuracy = rf_class_current_accuracy
        rf_class_best_random_state = random_state
    
print(f"Random Forest Classifier Model's Best Random State is {rf_class_best_random_state} yielding an accuracy of {rf_class_best_accuracy}")

Random Forest Classifier Model's Best Random State is 266 yielding an accuracy of 0.7425373134328358


In [76]:
'''Setting the Random Forest Classifier model's best random state as the split's random state + Calculating the Random Forest Classifier model's accuracy score,
precision score, recall score, and area under the curve score'''

predictors_train, predictors_test, target_train, target_test = train_test_split(predictors, target, test_size = 0.3, random_state = rf_class_best_random_state)
rf_class = RandomForestClassifier(random_state = rf_class_best_random_state)
rf_class.fit(predictors_train, target_train)
rf_class_prediction = rf_class.predict(predictors_test)
rf_class_probabilities = rf_class.predict_proba(predictors_test)[:, 1]

rf_class_acc = accuracy_score(target_test, rf_class_prediction, normalize = True)
rf_class_prec = precision_score(target_test, rf_class_prediction)
rf_class_rec = recall_score(target_test, rf_class_prediction)
rf_class_auc = metrics.roc_auc_score(target_test, rf_class_probabilities)

print("Random Forest Classifier Model Accuracy:", rf_class_acc)
print("Random Forest Classifier Model Area Under the Curve:", rf_class_prec)
print("Random Forest Classifier Model Precision:", rf_class_rec)
print("Random Forest Classifier Model Recall:", rf_class_auc)

Random Forest Classifier Model Accuracy: 0.7388059701492538
Random Forest Classifier Model Area Under the Curve: 0.6835443037974683
Random Forest Classifier Model Precision: 0.5454545454545454
Random Forest Classifier Model Recall: 0.7162751778136391


In [None]:
#Adding the Random Forest Classifier model's metrics into a dataframe to improve metric visualization 

rf_class_model_performance = pd.DataFrame(["Random Forest Classifier", rf_class_acc, rf_class_prec, rf_class_rec, rf_class_auc]).transpose()
rf_class_model_performance.columns = ["Classification Model", "Accuracy", "Precision", "Recall", "Area Under the Curve"]

rf_class_model_performance

Unnamed: 0,Classification Model,Accuracy,Precision,Recall,Area Under the Curve
0,Random Forest Classifier,0.738806,0.683544,0.545455,0.716275


In [None]:
#Concating all classification models and their metrics into a single dataframe to evaluate each model's performance

df_class_models = pd.concat([logreg_model_performance, svm_model_performance, rf_class_model_performance], axis=0)
df_class_models.reset_index(drop=True)

Unnamed: 0,Classification Model,Accuracy,Precision,Recall,Area Under the Curve
0,Logistic Regression,0.791045,0.753623,0.571429,0.737692
1,Support Vector Machine,0.761194,0.757143,0.53,0.714405
2,Random Forest Classifier,0.738806,0.683544,0.545455,0.716275


### Evaluation of Classification Models:

**Upon scrutinizing the evaluation metrics of three classification models—Logistic Regression, Support Vector Machine (SVM), and Random Forest Classifier—a clear pattern emerges, suggesting the Logistic Regression model as the optimal choice for this particular task. With an accuracy of 0.7910, Logistic Regression outperforms both SVM (0.7612) and Random Forest (0.7388), indicating a higher proportion of correctly classified instances. Moreover, Logistic Regression achieves a notable recall of 0.5714 and precision of 0.7536, striking a balance between accurately identifying positive instances and capturing true positives. While SVM and Random Forest present competitive performances, Logistic Regression's comprehensive excellence across multiple metrics positions it as the preferred model. The area under the curve (AUC) for Logistic Regression, at 0.7377, underscores its ability to discriminate between positive and negative instances. In summary, the Logistic Regression model stands out as the best choice, showcasing a harmonious blend of accuracy, precision, and recall for this classification task.**

# Q4 & 5:

In [134]:
'''Using the 'Age', 'SibSp', 'Parch', and 'Pclass' features as our predictors and the 'Fare' column as the target column that we want the model to predict.'''

predictors = data.loc[:, data.columns.isin(['Age', 'SibSp', 'Parch', 'Pclass'])]
target = data["Fare"]

In [135]:
'''For each model, we split 70% of the data to train the model and the other 30% to test the model. Also, a for loop is used to iterate through multiple random 
state values and find the one that yields the highest r2 score for each model.'''

linreg_best_random_state = 0
linreg_best_r2_score = 0

random_state_values = range(1, 1001)

### Regression Machine Learning Model #1: Linear Regression

In [136]:
'''The rationale for using a Linear Regression model is that it is a simple and interpretable algorithm for predicting numeric values, making it a good choice
for regression problems. It assumes a linear relationship between the input features and the target variable, which might be appropriate for predicting the fare
based on relevant features of a passenger.'''

for random_state in random_state_values:
    
    predictors_train, predictors_test, target_train, target_test = train_test_split(predictors, target, test_size = 0.3, random_state = random_state)
    linreg = LinearRegression()
    linreg.fit(predictors_train, target_train)
    linreg_prediction = linreg.predict(predictors_test)
    linreg_current_r2_score = r2_score(target_test, linreg_prediction)
    
    if linreg_current_r2_score > linreg_best_r2_score:
        linreg_best_r2_score = linreg_current_r2_score
        linreg_best_random_state = random_state
    
print(f"Linear Regression Model's Best Random State is {linreg_best_random_state} yielding an r2 score of {linreg_best_r2_score}")

Linear Regression Model's Best Random State is 93 yielding an r2 score of 0.5442760520461438


In [137]:
'''Setting the Linear Regression model's best random state as the split's random state + Calculating the Linear Regression model's mean absolute error,
mean squared error, and r-squared score'''

predictors_train, predictors_test, target_train, target_test = train_test_split(predictors, target, test_size = 0.3, random_state = linreg_best_random_state)
linreg = LinearRegression()
linreg.fit(predictors_train, target_train)
linreg_prediction = linreg.predict(predictors_test)

linreg_mae = mean_absolute_error(target_test, linreg_prediction)
linreg_mse = mean_squared_error(target_test, linreg_prediction)
linreg_r2 = r2_score(target_test, linreg_prediction)

print("Linear Regression Model Mean Absolute Error:", linreg_mae)
print("Linear Regression Model Mean Squared Error:", linreg_mse)
print("Linear Regression Model R-Squared Score:", linreg_r2)

Linear Regression Model Mean Absolute Error: 16.223368097299595
Linear Regression Model Mean Squared Error: 644.0634782783485
Linear Regression Model R-Squared Score: 0.5442760520461438


In [138]:
#Adding the Random Forest Classifier model's metrics into a dataframe to improve metric visualization 

linreg_model_performance = pd.DataFrame(["Linear Regression", linreg_mae, linreg_mse, linreg_r2]).transpose()
linreg_model_performance.columns = ["Regression Model", "Mean Absolute Error", "Mean Squared Error", "R-Squared Score"]

linreg_model_performance

Unnamed: 0,Regression Model,Mean Absolute Error,Mean Squared Error,R-Squared Score
0,Linear Regression,16.223368,644.063478,0.544276


### Regression Machine Learning Model #2: Gradient Boosting Regressor

In [139]:
gradboost_best_random_state = 0
gradboost_best_r2_score = 0

In [140]:
'''The Gradient Boosting Regressor is chosen for predicting passenger fares due to its ability to capture complex non-linear relationships, robustness to 
outliers, and versatility in handling various data types. As an ensemble method, it sequentially builds weak learners, providing improved predictive performance.
Its feature importance analysis aids in understanding influential factors, and fine-tuning options enhance adaptability to specific dataset characteristics.'''

for random_state in random_state_values:
    
    predictors_train, predictors_test, target_train, target_test = train_test_split(predictors, target, test_size = 0.3, random_state = random_state)
    gradboost = GradientBoostingRegressor()
    gradboost.fit(predictors_train, target_train)
    gradboost_prediction = gradboost.predict(predictors_test)
    gradboost_current_r2_score = r2_score(target_test, gradboost_prediction)
    
    if gradboost_current_r2_score > gradboost_best_r2_score:
        gradboost_best_r2_score = gradboost_current_r2_score
        gradboost_best_random_state = random_state
    
print(f"Gradient Boosting Regressor Model's Best Random State is {gradboost_best_random_state} yielding an r2 score of {gradboost_best_r2_score}")

Gradient Boosting Regressor Model's Best Random State is 614 yielding an r2 score of 0.7426767513693191


In [141]:
'''Setting the Gradient Boosting Regressor model's best random state as the split's random state + Calculating the Gradient Boosting Regressor model's mean 
absolute error, mean squared error, and r-squared score'''

predictors_train, predictors_test, target_train, target_test = train_test_split(predictors, target, test_size = 0.3, random_state = gradboost_best_random_state)
gradboost = GradientBoostingRegressor()
gradboost.fit(predictors_train, target_train)
gradboost_prediction = gradboost.predict(predictors_test)

gradboost_mae = mean_absolute_error(target_test, gradboost_prediction)
gradboost_mse = mean_squared_error(target_test, gradboost_prediction)
gradboost_r2 = r2_score(target_test, gradboost_prediction)

print("Gradient Boosting Regressor Model Mean Absolute Error:", gradboost_mae)
print("Gradient Boosting Regressor Model Mean Squared Error:", gradboost_mse)
print("Gradient Boosting Regressor Model R-Squared Score:", gradboost_r2)

Gradient Boosting Regressor Model Mean Absolute Error: 10.822064053918915
Gradient Boosting Regressor Model Mean Squared Error: 489.3183983110392
Gradient Boosting Regressor Model R-Squared Score: 0.7427004428938411


In [142]:
#Adding the Random Forest Classifier model's metrics into a dataframe to improve metric visualization 

gradboost_model_performance = pd.DataFrame(["Gradient Boosting Regressor", gradboost_mae, gradboost_mse, gradboost_r2]).transpose()
gradboost_model_performance.columns = ["Regression Model", "Mean Absolute Error", "Mean Squared Error", "R-Squared Score"]

gradboost_model_performance

Unnamed: 0,Regression Model,Mean Absolute Error,Mean Squared Error,R-Squared Score
0,Gradient Boosting Regressor,10.822064,489.318398,0.7427


In [143]:
#Concating all regression models and their metrics into a single dataframe to evaluate each model's performance

df_reg_models = pd.concat([linreg_model_performance, gradboost_model_performance], axis=0)
df_reg_models.reset_index(drop=True)

Unnamed: 0,Regression Model,Mean Absolute Error,Mean Squared Error,R-Squared Score
0,Linear Regression,16.223368,644.063478,0.544276
1,Gradient Boosting Regressor,10.822064,489.318398,0.7427


### Evaluation of Regression Models:

**The Gradient Boosting Regressor emerges as the superior model based on comprehensive evaluation metrics. With a lower Mean Absolute Error (MAE) of 10.82 compared to the Linear Regression model's 16.22, the Gradient Boosting Regressor consistently provides predictions that are closer to the actual values. Additionally, the Gradient Boosting model exhibits a lower Mean Squared Error (MSE) of 489.32, highlighting its superior performance in minimizing squared prediction errors. The R-Squared score further supports its excellence, standing at 0.74 compared to the Linear Regression model's 0.54. This higher R-Squared score indicates that the Gradient Boosting Regressor explains a larger proportion of the variance in the target variable. Collectively, these metrics underscore the superior accuracy and precision of the Gradient Boosting Regressor, establishing it as the preferred model for the regression task.**

# Q6:

### Improving Classification Model Accuracy (Logistic Regression):

- Handling Outliers: Evaluate and handle outliers in the dataset that might be influencing the model's performance.

- Feature Scaling: Ensure that features are scaled appropriately, especially if the logistic regression model is sensitive to feature magnitudes.

- Feature Importance Analysis: Use feature importance analysis to focus on the most influential features and potentially eliminate less important ones.

- Addressing Class Imbalance: Utilizing techniques such as oversampling the minority class or undersampling the majority class to address dataset imbalance

### Improving Regression Model Accuracy (Gradient Boosting Regressor):

- Hyperparameter Tuning: Experiment with different values for hyperparameters like the number of estimators, learning rate, and the maximum depth of each tree to find the optimal configuration.

- Feature Engineering: Identify and create additional relevant features that can improve the model's ability to capture underlying patterns in the data.

- Cross-Validation: Employ cross-validation to assess how well the model generalizes to new data and to identify potential issues like overfitting.

- Early Stopping: Implement early stopping to halt training once the model performance stops improving on a validation set. This prevents overfitting.