# Section 4 – Regression and Machine Learning
This section is divided into three parts:
1. Multiple Regression Model to predict aspect ratio (4.1)
2. Logistic Regression for Letter vs. Non-Letter Classification (4.2)
3. Categorical Features via Median Splits (4.3)

## Library Installation
The commands needed to install the libraries required for section four.

In [None]:
%pip install pandas
%pip install numpy 
%pip install matplotlib 
%pip install seaborn 
%pip install scipy 
%pip install scikit-learn
%pip install IPython
%pip install statsmodels

## Import Libraries and Load Data

In [None]:
import pandas
import numpy
import matplotlib.pyplot as pyplot
import seaborn
import statsmodels.api as smapi
from scipy import stats
from IPython.display import display
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

data = pandas.read_csv("40394874_features.csv", delimiter=',')
print("Head of Feature Data")
print(data.head())

numeric_cols = ['nr_pix', 'rows_with_1', 'cols_with_1', 'rows_with_3p', 'cols_with_3p', 'aspect_ratio', 'neigh_1', 'no_neigh_above', 'no_neigh_below', 'no_neigh_left', 'no_neigh_right', 'no_neigh_horiz', 'no_neigh_vert', 'connected_areas', 'eyes', 'custom']

data[numeric_cols] = data[numeric_cols].apply(pandas.to_numeric)

## Multiple Regression Model to predict aspect ratio
In this section, the goal was to predict the aspect_ratio of each handwritten symbol using a parsimonious set of predictors selected from the other 15 features. To achieve this, I fit a multiple regression model with Ordinary Least Squares (OLS).

In [None]:
x = data[predictor_feats]
y = data['aspect_ratio']

linear_regression = LinearRegression()

selector = RFECV(linear_regression, step=1, cv=10, scoring='r2')
selector.fit(x, y)

selected_features = x.columns[selector.support_]
print("Automatically selected features:", selected_features.tolist())

scores = selector.cv_results_['mean_test_score']

pyplot.figure(figsize=(8, 6))
pyplot.xlabel("Number of features selected")
pyplot.ylabel("CV R^2 Score")
pyplot.plot(range(1, len(scores) + 1), scores)
pyplot.title("RFECV: Feature Selection Performance")
pyplot.show

x = data[selected_features]
x = smapi.add_constant(x)
model = smapi.OLS(y, x).fit()
print(f"\n{model.summary()}\n")

## Logistic Regression for Letter vs. Non-Letter Classification
In this section, I built a logistic regression model to classify images as letters or non-letters using the most discriminative feature identified in Section 3.3, connected_areas.

In [None]:
feature = 'connected_areas'

x = data[[feature]]

x = x.copy()
y = data['label'].apply(lambda x: 1 if x in list("abcdefghij") else 0)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

logistic_regression = LogisticRegression()
logistic_regression.fit(x_train, y_train)

y_predicted = logistic_regression.predict(x_test)

print("Accuracy: ", accuracy_score(y_test, y_predicted))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_predicted))
print("Classification Report:\n", classification_report(y_test, y_predicted))

print("\n\n")

pyplot.figure(figsize=(8, 6))
pyplot.scatter(x, y, c=y, cmap='bwr', alpha=0.5, edgecolors='k')
x_values = numpy.linspace(x.min(), x.max(), 300).reshape(-1, 1)
x_values_df = pandas.DataFrame(x_values, columns=x.columns)
y_probability = logistic_regression.predict_proba(x_values_df)[:, 1]
pyplot.plot(x_values, y_probability, color='black', linewidth=2)
pyplot.title("Logistic Regression Fit Using Connected Areas")
pyplot.xlabel(f"{feature}")
pyplot.ylabel("Probability of being a Letter")
pyplot.tight_layout()
pyplot.show()

## Categorical Features via Median Splits
In this section, I transformed three continuous features into categorical features by applying a median split

In [None]:
data['split1'] = (data['nr_pix'] > data['nr_pix'].median()).astype(int)
data['split2'] = (data['aspect_ratio'] > data['aspect_ratio'].median()).astype(int)
data['split3'] = (data['neigh_1'] > data['neigh_1'].median()).astype(int)

def map_label(x):
    if x in list("abcdefghij"):
        return "Letters"
    elif x.lower() in ["smiley", "sad"]:
        return "Faces"
    elif x.lower() in ["xclaim"]:
        return "Exclamation Marks"
    else:
        return x

data['class'] = data['label'].apply(map_label)
data['class'] = pandas.Categorical(data['class'], categories=["Letters", "Faces", "Exclamation Marks"], ordered=True)

prop_table = data.groupby('class', observed=True)[['split1', 'split2', 'split3']].mean().reset_index()
print("Proportion of '1's by Class for Each Split Feature:\n")
display(prop_table)