# Logistic Regression Training
Binary classification algorithm.
Typically used when the target variable has only two possible outcomes.

However, logistic regression can be extended to handle multi-class classification problems through techniques such as *one-vs-all* or *multinomial* logistic regression.

## One-vs-all or One-vs-Rest (OVR)
Train separate binary logistic regression model for each class.
Each model is trained to distinguish that class from all other classes.
During prediction, we run each observation through all models, and the class with the highest probability is assigned as the predicted class.

Our dataset has a discrete number of possible outcomes: `[Ravenclaw, Slytherin, Gryffindor, Hufflepuff]`.

This method allows breaking down by splitting up into multiple binary class models.

We will be using `k=4` *binary classifiers*.

## Features Selection
Based on data visualization:
- `Arithmancy` and `Care of magical Creatures` cannot classify well.
- `Defense Against the Dark Arts` and `Astronomy` are anti-correlated; we can drop one.

All other numerical features will be used for training.

## Data preparation
- Only meaningful features will be used
- Remove rows containing `NaN`
- *Standardize* data

## TODO
- use numpy to load and write to file
- Data preparation
    - Drop `Arithmancy`, `Care of Magical Creatures` and `Defense Against the Dark Arts`
    - Drop rows containing `NaN`
    - Standardize

In [52]:
%run "utils.ipynb"

df = get_data()

# Data preprocessing
print('Data frame shape:', df.shape)
excluded_features = ['Arithmancy', 'Care of Magical Creatures', 'Defense Against the Dark Arts']
df.drop(df.columns[1:5], inplace=True, axis=1)
df.drop(excluded_features, inplace=True, axis=1)
df.dropna(inplace=True)
print('Data frame shape after data processing:', df.shape)

# Extract houses
df_houses = df['Hogwarts House']

df_features = df.drop(df.columns[:1], axis=1)
print(df_features.shape)

# Standardize data
df_std_features = df_features.apply(lambda x: (x - x.mean()) / x.std())
df_std_features.head()


Data frame shape: (1600, 18)
Data frame shape after data processing: (1333, 11)
(1333, 10)


Unnamed: 0_level_0,Astronomy,Herbology,Divination,Muggle Studies,Ancient Runes,History of Magic,Transfiguration,Potions,Charms,Flying
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,-1.019405,0.86751,0.366766,1.01501,0.341729,0.50466,0.220663,-0.70162,1.193099,-0.508231
1,-1.142486,-1.376697,-2.140728,-0.547946,-1.205529,0.251192,0.657019,0.412017,-1.012445,-1.395502
2,-0.785784,1.250242,0.710837,1.823594,1.000191,0.126793,1.320875,0.888527,1.813171,0.079217
3,1.254526,-1.474355,0.197885,-0.650158,0.261869,-1.759797,-2.499039,-1.657499,-1.542783,1.824033
4,0.754013,-1.727884,-0.23645,-0.459282,0.969563,-1.451893,-2.110816,-0.53395,-1.490523,1.386752
