<h1> Exercise 1 </h1>

<h3>Importing libraries</h3>

In [91]:
import pandas as pd
import numpy as np
import sklearn as sk
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

<h3>Importing data</h3>

In [92]:
df = pd.read_csv('Datasets/boston_house_prices.csv', header=1)
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


<h3>Splitting the dataset</h3>

In [93]:
Xtrain, Xtest, ytrain, ytest = train_test_split(df.drop('MEDV', axis=1), df['MEDV'], test_size=0.2, random_state=20)

<h3>Standartizing the data</h3>

In [94]:
scaler = StandardScaler()
scaler.fit(Xtrain)
Xtrain = scaler.transform(Xtrain)
Xtest = scaler.transform(Xtest)

<h3>Train or fit the data into a model using the Support Vector Machine Algorithm and test it</h3>

In [95]:
svr = SVR(kernel='linear', C=1.0, epsilon=0.2)
svr.fit(Xtrain, ytrain)
svr.score(Xtest, ytest)

0.7432190393711677

<h3>Select best hyperparameters of the model using GridSearch</h3>

In [96]:
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10], 'epsilon':[0.1, 0.2]}
clf = GridSearchCV(svr, parameters)
clf.fit(Xtrain, ytrain)
clf.score(Xtest, ytest)

0.8420119900766948

In [97]:
clf.best_params_

{'C': 10, 'epsilon': 0.1, 'kernel': 'rbf'}

<h3>Create a function to test the different hyperparameters.</h3>

In [98]:
def test_svr(Xtrain, Xtest, ytrain, ytest, kernel, C, epsilon):
    svr = SVR(kernel=kernel, C=C, epsilon=epsilon)
    svr.fit(Xtrain, ytrain)
    return svr.score(Xtest, ytest)

In [99]:
test_svr(Xtrain, Xtest, ytrain, ytest, 'linear', 1.0, 0.2)

0.7432190393711677

<h3>Train or fit the data using other algorithms</h3>

In [100]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import SGDRegressor

In [101]:
lr = LinearRegression()
lr.fit(Xtrain, ytrain)
lr.score(Xtest, ytest)

0.7438826183113534

In [102]:
ridge = Ridge(alpha=0.5)
ridge.fit(Xtrain, ytrain)
ridge.score(Xtest, ytest)

0.7438940100967497

In [103]:
lasso = Lasso(alpha=0.1)
lasso.fit(Xtrain, ytrain)
lasso.score(Xtest, ytest)

0.7320306418695779

In [104]:
rfr = RandomForestRegressor(max_depth=2, random_state=0)
rfr.fit(Xtrain, ytrain)
rfr.score(Xtest, ytest)

0.6946860227365617

In [105]:
gbr = GradientBoostingRegressor(random_state=0)
gbr.fit(Xtrain, ytrain)
gbr.score(Xtest, ytest)

0.8223718228825274

In [106]:
sgd = SGDRegressor(max_iter=1000, tol=1e-3)
sgd.fit(Xtrain, ytrain)
sgd.score(Xtest, ytest)

0.7414273286112937

<h3>Compare the performance of the different algorithms. Which is the best Model?</h3>

With the default parameters, the best model is the GradientBoostingRegressor.
But, if we tune the hyperparameters, the best model is the SVR.

<h1> Exercise 2 </h1>

In [107]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC


In [108]:
# Load the dataset
df = pd.read_csv('Datasets/titanic.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     891 non-null    int64  
 1   pclass       891 non-null    int64  
 2   sex          891 non-null    object 
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64  
 5   parch        891 non-null    int64  
 6   fare         891 non-null    float64
 7   embarked     889 non-null    object 
 8   class        891 non-null    object 
 9   who          891 non-null    object 
 10  adult_male   891 non-null    bool   
 11  deck         203 non-null    object 
 12  embark_town  889 non-null    object 
 13  alive        891 non-null    object 
 14  alone        891 non-null    bool   
dtypes: bool(2), float64(2), int64(4), object(7)
memory usage: 92.4+ KB


In [109]:
# Drop rows with missing values
df = df.dropna()

In [110]:
# Create a LabelEncoder object
le = LabelEncoder()

# List of categorical columns to encode
categorical_cols = ['sex', 'embarked', 'class', 'who', 'deck', 'embark_town', 'alive']

# Convert boolean columns to int
df['adult_male'] = df['adult_male'].astype(int)
df['alone'] = df['alone'].astype(int)

# Loop over the categorical columns
for col in categorical_cols:
    # If the column has null values, fill them with the string 'missing'
    df[col] = df[col].fillna('missing')
    # Transform the column with LabelEncoder
    df[col] = le.fit_transform(df[col])

In [111]:
# Split the dataset into features and target variable
y = df['survived']
X = df.drop('survived', axis=1)


# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=20)

In [112]:
# Define the models
models = [
    ('Logistic Regression', LogisticRegression()),
    ('K-Nearest Neighbors', KNeighborsClassifier()),
    ('Decision Tree', DecisionTreeClassifier()),
    ('Random Forest', RandomForestClassifier()),
    ('Support Vector Machine', SVC())
]

# Train and evaluate each model
for name, model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f'{name} Accuracy: {accuracy * 100:.2f}%')

Logistic Regression Accuracy: 100.00%
K-Nearest Neighbors Accuracy: 63.64%
Decision Tree Accuracy: 100.00%
Random Forest Accuracy: 100.00%
Support Vector Machine Accuracy: 69.09%


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
