In order to optimize our inventory, we would like to know which films will be rented next month and we are asked to create a model to predict it.

Criteria for why a movie would get rented = 
possible features that could help to predict, whether a film gets rented or not

+ Info on previous rentals
+ The category of the movie
+ Actors that play in it
+ The rental rate, the allowed rental duration, rating, the length

++ query to get more infos on previous rentals ++

"""
create or replace view rental_info as
SELECT inventory_id, count(rental_id) AS numb_rentals , TIMESTAMPDIFF(DAY, rental_date,return_date) AS days_rented FROM rental AS r
LEFT JOIN inventory as I
USING (inventory_id)
GROUP BY inventory_id, days_rented;
"""

++ query to get more infos on movies & actors ++

CREATE OR REPLACE VIEW film_info AS
SELECT f.film_id, f.title, f.rental_duration, f.rental_rate, f.rating, f.length, 
GROUP_CONCAT(
CONCAT_WS(' ', a.first_name, a.last_name)
separator ', ') actor_list
FROM film_actor AS fa
	RIGHT JOIN film AS f ON fa.film_id = f.film_id
    RIGHT JOIN actor AS a ON fa.actor_id = a.actor_id
GROUP BY film_id;


++ query to get more infos on movies & actors ++ 

 CREATE OR REPLACE VIEW category_info AS
 SELECT film_id, name FROM film_category AS fc 
 JOIN category AS c
 USING (category_id);
 
 SELECT * FROM category_info
 ORDER BY film_id;


In [1]:
# Connect to the database & importing the libraries
import pymysql
from sqlalchemy import create_engine
import pandas as pd
import getpass  # To get the password without showing the input

In [2]:
#connect SQL, better than just importing csvs, since database will be updated constantly
password = getpass.getpass()
engine = f'mysql+pymysql://root:{password}@localhost/sakila'

········


In [3]:
#have all dfs be organized with distinct film_id in ASC order!
#do this in the sql queries

In [4]:
query_1 = '''
SELECT * FROM sakila.rental_info
'''

In [5]:
data_query1 = pd.read_sql(query_1, engine)
data_query1

Unnamed: 0,film_id,numb_rentals,days_rented
0,1,24,4.55
1,2,7,5.33
2,3,12,2.83
3,4,23,4.36
4,5,12,6.73
...,...,...,...
953,996,7,4.00
954,997,6,4.67
955,998,9,5.25
956,999,17,5.18


In [6]:
query_2 = '''
SELECT * FROM sakila.film_info
'''

In [7]:
data_query2 = pd.read_sql(query_2, engine)
data_query2

Unnamed: 0,film_id,title,rental_duration,rental_rate,rating,length,actor_list
0,1,ACADEMY DINOSAUR,6,0.99,PG,86,"OPRAH KILMER, ROCK DUKAKIS, MARY KEITEL, PENEL..."
1,2,ACE GOLDFINGER,3,4.99,G,48,"BOB FAWCETT, MINNIE ZELLWEGER, SEAN GUINESS, C..."
2,3,ADAPTATION HOLES,7,2.99,NC-17,50,"NICK WAHLBERG, BOB FAWCETT, CAMERON STREEP, RA..."
3,4,AFFAIR PREJUDICE,5,2.99,G,117,"JODIE DEGENERES, SCARLETT DAMON, KENNETH PESCI..."
4,5,AFRICAN EGG,6,2.99,G,130,"MATTHEW CARREY, THORA TEMPLE, GARY PHOENIX, DU..."
...,...,...,...,...,...,...,...
992,996,YOUNG LANGUAGE,6,0.99,G,183,"CHRISTOPHER WEST, MENA HOPPER, ED CHASE, JULIA..."
993,997,YOUTH KICK,4,0.99,NC-17,179,"SANDRA KILMER, VAL BOLGER, SCARLETT BENING, IA..."
994,998,ZHIVAGO CORE,6,0.99,NC-17,105,"KENNETH HOFFMAN, WILLIAM HACKMAN, UMA WOOD, NI..."
995,999,ZOOLANDER FICTION,5,2.99,R,101,"CARMEN HUNT, MARY TANDY, PENELOPE CRONYN, WHOO..."


In [8]:
query_3 = '''
SELECT * FROM sakila.category_info
'''

In [9]:
data_query3 = pd.read_sql(query_3, engine)
data_query3

Unnamed: 0,film_id,name
0,1,Documentary
1,2,Horror
2,3,Documentary
3,4,Horror
4,5,Family
...,...,...
995,996,Documentary
996,997,Music
997,998,Horror
998,999,Children


In [10]:
#concat via columns, axis = 1 

frames = [data_query1, data_query2, data_query3]

df_v1 = pd.concat(frames, axis =1)

In [11]:
df_v1

Unnamed: 0,film_id,numb_rentals,days_rented,film_id.1,title,rental_duration,rental_rate,rating,length,actor_list,film_id.2,name
0,1.0,24.0,4.55,1.0,ACADEMY DINOSAUR,6.0,0.99,PG,86.0,"OPRAH KILMER, ROCK DUKAKIS, MARY KEITEL, PENEL...",1,Documentary
1,2.0,7.0,5.33,2.0,ACE GOLDFINGER,3.0,4.99,G,48.0,"BOB FAWCETT, MINNIE ZELLWEGER, SEAN GUINESS, C...",2,Horror
2,3.0,12.0,2.83,3.0,ADAPTATION HOLES,7.0,2.99,NC-17,50.0,"NICK WAHLBERG, BOB FAWCETT, CAMERON STREEP, RA...",3,Documentary
3,4.0,23.0,4.36,4.0,AFFAIR PREJUDICE,5.0,2.99,G,117.0,"JODIE DEGENERES, SCARLETT DAMON, KENNETH PESCI...",4,Horror
4,5.0,12.0,6.73,5.0,AFRICAN EGG,6.0,2.99,G,130.0,"MATTHEW CARREY, THORA TEMPLE, GARY PHOENIX, DU...",5,Family
...,...,...,...,...,...,...,...,...,...,...,...,...
995,,,,999.0,ZOOLANDER FICTION,5.0,2.99,R,101.0,"CARMEN HUNT, MARY TANDY, PENELOPE CRONYN, WHOO...",996,Documentary
996,,,,1000.0,ZORRO ARK,3.0,4.99,NC-17,50.0,"IAN TANDY, NICK DEGENERES, LISA MONROE",997,Music
997,,,,,,,,,,,998,Horror
998,,,,,,,,,,,999,Children


In [12]:
#different approach where we wouldn't have the problem of three film_id columns, which at some point don't align anymore


df_1 = pd.merge(data_query1,data_query2, how='left',on='film_id')
df = pd.merge(df_1,data_query3, how='left',on='film_id')

In [13]:
df

Unnamed: 0,film_id,numb_rentals,days_rented,title,rental_duration,rental_rate,rating,length,actor_list,name
0,1,24,4.55,ACADEMY DINOSAUR,6.0,0.99,PG,86.0,"OPRAH KILMER, ROCK DUKAKIS, MARY KEITEL, PENEL...",Documentary
1,2,7,5.33,ACE GOLDFINGER,3.0,4.99,G,48.0,"BOB FAWCETT, MINNIE ZELLWEGER, SEAN GUINESS, C...",Horror
2,3,12,2.83,ADAPTATION HOLES,7.0,2.99,NC-17,50.0,"NICK WAHLBERG, BOB FAWCETT, CAMERON STREEP, RA...",Documentary
3,4,23,4.36,AFFAIR PREJUDICE,5.0,2.99,G,117.0,"JODIE DEGENERES, SCARLETT DAMON, KENNETH PESCI...",Horror
4,5,12,6.73,AFRICAN EGG,6.0,2.99,G,130.0,"MATTHEW CARREY, THORA TEMPLE, GARY PHOENIX, DU...",Family
...,...,...,...,...,...,...,...,...,...,...
953,996,7,4.00,YOUNG LANGUAGE,6.0,0.99,G,183.0,"CHRISTOPHER WEST, MENA HOPPER, ED CHASE, JULIA...",Documentary
954,997,6,4.67,YOUTH KICK,4.0,0.99,NC-17,179.0,"SANDRA KILMER, VAL BOLGER, SCARLETT BENING, IA...",Music
955,998,9,5.25,ZHIVAGO CORE,6.0,0.99,NC-17,105.0,"KENNETH HOFFMAN, WILLIAM HACKMAN, UMA WOOD, NI...",Horror
956,999,17,5.18,ZOOLANDER FICTION,5.0,2.99,R,101.0,"CARMEN HUNT, MARY TANDY, PENELOPE CRONYN, WHOO...",Children


In [14]:
df.dtypes

film_id              int64
numb_rentals         int64
days_rented        float64
title               object
rental_duration    float64
rental_rate        float64
rating              object
length             float64
actor_list          object
name                object
dtype: object

In [15]:
#why 2 film_id are float, 1 is integer?

In [17]:
# Drop last two film_id columns
#df = df.loc[:,~df.columns.duplicated()]
#df

In [18]:
df.shape

(958, 10)

In [19]:
#rename name column into category

df.rename(columns = {'name':'category'}, inplace = True) 

In [20]:
df.columns

Index(['film_id', 'numb_rentals', 'days_rented', 'title', 'rental_duration',
       'rental_rate', 'rating', 'length', 'actor_list', 'category'],
      dtype='object')

### Dealing with NaNs

In [21]:
#Check for NaN
#why does film_id return 0 Nan, if I turn it into an object?

df.isnull().sum()

film_id            0
numb_rentals       0
days_rented        0
title              3
rental_duration    3
rental_rate        3
rating             3
length             3
actor_list         3
category           0
dtype: int64

In [22]:
df.isnull().sum().sum()

18

In [23]:
#Other methods to check for NaN, does not show where NaN are

df.isnull().values.any() 

True

In [24]:
df.isna().sum()

film_id            0
numb_rentals       0
days_rented        0
title              3
rental_duration    3
rental_rate        3
rating             3
length             3
actor_list         3
category           0
dtype: int64

In [25]:
#calculate percentag of Null Values
##what does reset_index mean?##

nulls = pd.DataFrame(df.isna().sum()/len(df))
nulls= nulls.reset_index()
nulls.columns = ['column_name', 'Percentage Null Values']
nulls.sort_values(by='Percentage Null Values', ascending = False)

Unnamed: 0,column_name,Percentage Null Values
3,title,0.003132
4,rental_duration,0.003132
5,rental_rate,0.003132
6,rating,0.003132
7,length,0.003132
8,actor_list,0.003132
0,film_id,0.0
1,numb_rentals,0.0
2,days_rented,0.0
9,category,0.0


In [26]:
#drop NaN, is it smart to drop rows here??

df.dropna(axis=0, inplace=True )

In [27]:
df.isnull().sum()

film_id            0
numb_rentals       0
days_rented        0
title              0
rental_duration    0
rental_rate        0
rating             0
length             0
actor_list         0
category           0
dtype: int64

In [28]:
df.dtypes

film_id              int64
numb_rentals         int64
days_rented        float64
title               object
rental_duration    float64
rental_rate        float64
rating              object
length             float64
actor_list          object
category            object
dtype: object

### split into numerical and categorical  data


In [30]:
import numpy as np
cat = df.select_dtypes(include = np.object)

In [32]:
num = df.select_dtypes(include = np.number)

### Cleaning categorical data

In [42]:
## checking all the categorical columns # for empty space, check if I could drop some

df['title'].value_counts().to_frame()

Unnamed: 0,title
SLEEPY JAPANESE,1
IDAHO LOVE,1
SEA VIRGIN,1
CAMELOT VACATION,1
SUNSET RACER,1
...,...
MYSTIC TRUMAN,1
RESERVOIR ADAPTATION,1
TWISTED PIRATES,1
ALABAMA DEVIL,1


In [43]:
df['title'].unique()

array(['ACADEMY DINOSAUR', 'ADAPTATION HOLES', 'AGENT TRUMAN',
       'AIRPLANE SIERRA', 'ALABAMA DEVIL', 'ALADDIN CALENDAR',
       'ALASKA PHANTOM', 'ALI FOREVER', 'ALIEN CENTER', 'ALLEY EVOLUTION',
       'ALTER VICTORY', 'AMADEUS HOLY', 'ANONYMOUS HUMAN', 'ANTHEM LUKE',
       'ANTITRUST TOMATOES', 'APACHE DIVINE', 'ARABIA DOGMA',
       'ARACHNOPHOBIA ROLLERCOASTER', 'ARIZONA BANG',
       'ARTIST COLDBLOODED', 'ATTACKS HATE', 'ATTRACTION NEWTON',
       'BABY HALL', 'BACKLASH UNDEFEATED', 'BALLOON HOMEWARD',
       'BANG KWAI', 'BASIC EASY', 'BED HIGHBALL', 'BEDAZZLED MARRIED',
       'BEETHOVEN EXORCIST', 'BEHAVIOR RUNAWAY', 'BENEATH RUSH',
       'BERETS AGENT', 'BETRAYED REAR', 'BIKINI BORROWERS',
       'BILKO ANONYMOUS', 'BILL OTHERS', 'BINGO TALENTED',
       'BIRCH ANTITRUST', 'BIRDCAGE CASPER', 'BLACKOUT PRIVATE',
       'BLADE POLISH', 'BLINDNESS GUN', 'BOILED DARES', 'BORN SPINAL',
       'BOUND CHEAPER', 'BOWFINGER GABLES', 'BRANNIGAN SUNRISE',
       'BRAVEHEART HUMAN

In [44]:
#similiar as with actors, too many values--> would make the df huge, if I encode it, 
#i could check for certain keywords,extract them and then bucket the values in bins according to sensationality of keywords
#for now i just drop them I think

In [34]:
df['title'].value_counts().index
#what does index mean here?

Index(['KARATE MOON', 'ISLAND EXORCIST', 'SEATTLE EXPECATIONS', 'BANG KWAI',
       'UNFORGIVEN ZOOLANDER', 'ROBBERY BRIGHT', 'ENDING CROWDS',
       'FEATHERS METAL', 'JAWS HARRY', 'TERMINATOR CLUB',
       ...
       'DATE SPEED', 'SEA VIRGIN', 'FEUD FROGMEN', 'SPLENDOR PATTON',
       'DANCES NONE', 'AFRICAN EGG', 'JAWBREAKER BROOKLYN', 'HALLOWEEN NUTS',
       'SILENCE KANE', 'CASUALTIES ENCINO'],
      dtype='object', length=955)

In [35]:
##what does this rating stand for?##

df['rating'].value_counts().to_frame()

Unnamed: 0,rating
PG-13,213
NC-17,202
R,187
PG,183
G,170


In [38]:
#maybe drop some of the ratings, such as Sian did in her example? how would i decide this?

df = df[~df['rating'].isin(['R', 'G'])]

df['rating'].value_counts().to_frame()

Unnamed: 0,rating
PG-13,213
NC-17,202
PG,183


In [39]:
df['actor_list'].value_counts().to_frame()

Unnamed: 0,actor_list
"MEG HAWKE, CUBA ALLEN, PENELOPE MONROE, KEVIN GARLAND, KIM ALLEN",1
"GINA DEGENERES, LIZA BERGMAN, IAN TANDY",1
"CHRISTOPHER WEST, GENE MCKELLEN, MATTHEW CARREY, REESE WEST, CHRISTIAN GABLE, UMA WOOD, MINNIE ZELLWEGER, DARYL WAHLBERG",1
"MILLA PECK, ED MANSFIELD, LUCILLE DEE, JOHN SUVARI",1
"VIVIEN BERGEN, HELEN VOIGHT, AL GARLAND",1
...,...
"KENNETH PALTROW, KEVIN GARLAND",1
"ELVIS MARX, MILLA PECK, KENNETH TORN, SCARLETT BENING, FAY WOOD, GREGORY GOODING",1
"OPRAH KILMER, DAN TORN, BEN WILLIS",1
"BELA WALKEN, PENELOPE GUINESS, JENNIFER DAVIS, KARL BERRY, CUBA OLIVIER, RIP CRAWFORD, CHRISTIAN AKROYD, CARY MCCONAUGHEY, RICHARD PENN",1


In [None]:
# I will leave this out for now until I know how to best extract useful features here, probably would need to make another column with 
#fame-rating, but then I would need to find a matrix of how to calculate the fame of one actor

In [41]:
# in case needed: replace value with map, filter value with filter, etc. check flo's workshop

###  replacing a value (From Sians example)
### data['k_symbol'] = list(map(cleankSymbol, data['k_symbol'])) 

In [40]:
df['category'].value_counts().to_frame()

Unnamed: 0,category
Sports,46
Drama,43
Animation,43
Foreign,42
Family,42
New,40
Children,39
Music,38
Documentary,38
Comedy,37


### Cleaning numerical data

In [None]:
#correlation matrix (for numerial only)---> very little multicollinearity

df.corr()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# Checking for multicollinearity (0.8 percent circa)

corr_matrix=df.corr(method='pearson')  # default
fig, ax = plt.subplots(figsize=(10, 8))
ax = sns.heatmap(corr_matrix, annot=True)
plt.show()

In [None]:
# why multicollinearity is negative?

## Analyze extracted features and transform them. You may need to encode some categorical variables, or scale numerical variables.

In [None]:
# How would I extract features??? still need to learn this

#Before encoding, i should bin the categorical variables first!


#### Visualize data

In [None]:
#check histogram of numerical data

In [None]:
#just for fun

sns.distplot(df['film_id'])
plt.show()

In [None]:
sns.distplot(df['numb_rentals'])
plt.show()

In [None]:
sns.distplot(df['days_rented'])
plt.show()

In [None]:
#need to reformat the x-axis 

sns.distplot(df['rental_duration'])
plt.show()


In [None]:
# https://seaborn.pydata.org/generated/seaborn.distplot.html
#https://www.geeksforgeeks.org/formatting-axes-in-python-matplotlib/

In [None]:
sns.distplot(df['rental_rate'])
plt.show()

In [None]:
# with reformated x-axis 

sns.distplot(df['rental_rate'],bins = 3, kde = False)
plt.show()

In [None]:
sns.distplot(df['length'])
plt.show()

In [None]:
# all of them are pretty normally distributed, why still need to normalize?

### Data Pre-Processing

In [None]:
from sklearn.preprocessing import Normalizer
# from sklearn.preprocessing import StandardScaler

In [None]:
#split into numerical data  #normalize them or standardize?

import numpy as np 
X = df.select_dtypes(include = np.number)
# Normalizing data
transformer = Normalizer().fit(X)
x_normalized = transformer.transform(X)
x = pd.DataFrame(x_normalized)

In [None]:
#why columns names gone after using Normalizer?

x

In [None]:
# using the index 4 for rental rate

sns.distplot(x[4])
plt.show()

In [None]:
# when to use Normalizer , when to use StandardScaler? or any other method?

### Encode categorical 

In [None]:
df.dtypes

In [None]:
### need to go over this whole section, want to bin categorical data firs

In [None]:
##this shouldn't look like this

df['rating'] = pd.get_dummies(df['rating'])

In [None]:
df['category'] = pd.get_dummies(df['category'])

In [None]:
# is giving me a very big df, maybe before inclduing this 
# doublecheck if there is actually a correlation with the predicted value, bin it!

df['title'] = pd.get_dummies(df['title'])

In [None]:
#cat = df.select_dtypes(include = np.object)

#categorical = pd.get_dummies(cat, columns=['title', 'rating', 'category'])

#when I tried this approach the categorical data was not turned into uint8




In [None]:
#categorical.head()

In [None]:
#categorical.dtypes

In [None]:
# is giving me a very big df, maybe before inclduing this 
# doublecheck if there is actually a correlation with the predicted value


#categorical = df[['title', 'rating', 'category']]

df.corr()['numb_rentals'].to_frame()



In [None]:
#concatenate both, why use np instead of pd... x, categorical are both dataframes, why are we working with arrays now?

#X = np.concatenate((x, categorical), axis=1)

X = df

In [None]:
#why df suddenly has a lot less rows?
#obviously the index and the no of rows don't mathc?

X

## Create a query to get the list of films and a boolean indicating if it was rented last month. This would be our target variable.

define the y value

In [None]:
target_query = '''
select film_id, 
		case times_rented_last_month
			when times_rented_last_month>1 then 0
            else 1
        end as rented
from(select film_id,
        sum(case 
			when rental_date between '2005-07-01' and '2005-08-01' then 1
            else 0
		end ) as times_rented_last_month
      -- create a cte table
	from film left join inventory using (film_id) left join rental using (inventory_id)
	group by 1) as cte;
'''

In [None]:
target = pd.read_sql(target_query, engine)

In [None]:
target

In [None]:
#y und x need to have the same amount of rows for test -train data split
# what happens when y a lot smaller? 
#user RandomOverSampler?

In [None]:
##why set index?##

y = target['rented'][:598]
target.set_index("film_id")
target

In [None]:
target.dtypes

In [None]:
#change rented column into a boolean dtype??

#target['rented'] = target['rented'].astype(bool) 

#not necessary since consist of 1 and 0s 

Split train & test data

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
X_train, X_test,y_train, y_test = train_test_split(X,y, test_size = 0.25, random_state=42)
#scaled_x = StandardScaler().fit_transform(X)


In [None]:
from sklearn.preprocessing import StandardScaler
sc= StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
X_test

In [None]:
y_train.value_counts()

### Feature Scaling

In [46]:
#from andres

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

NameError: name 'X_train' is not defined

In [None]:
classifier = LogisticRegression(random_state=0, solver='lbfgs',
                        multi_class='ovr')

#classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

In [None]:
X_test

In [None]:
y_pred = classifier.predict(X_test)
y_pred

## Confusion Matrix and Accuracy Score

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test,y_pred)
print(cm)
m = accuracy_score(y_test, y_pred)
print('Our model has',round(m*100,2),' % of accuracy')

## Classification report

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

In [None]:
## whatever this is ?!

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, classifier.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, classifier.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('LOgReg')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
import numpy as np

In [None]:
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')


import numpy as np 

# Normalizing data
transformer = Normalizer().fit(target)
target_normalized = transformer.transform(target)
target = pd.DataFrame(target_normalized)


#try StandardScaler

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

scaled_x = StandardScaler().fit_transform(x)

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

scaled_target = StandardScaler().fit_transform(target)

In [None]:
classification.score(X_test, y_test)
predictions = classification.predict(X_test)
classification.score(X_test, y_test)

In [None]:
#lets bring in the confusion matrix

from sklearn.metrics import confusion_matrix
cf_matrix = confusion_matrix(y_test, predictions)
print(cf_matrix)

In [None]:
sns.heatmap(cf_matrix/np.sum(cf_matrix), annot=True, 
            fmt='.2%', cmap='Blues')

In [None]:
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
                cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
                     cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
          zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')

In [None]:
#ROC & AUC analysis... no idea again, how to do that nor what this really does... 

Many machine learning algorithms perform better when numerical input variables are scaled to a standard range.

This includes algorithms that use a weighted sum of the input, like linear regression, and algorithms that use distance measures, like k-nearest neighbors.

 Normalization scales each input variable separately to the range 0-1, which is the range for floating-point values where we have the most precision. 
 
 Standardization scales each input variable separately by subtracting the mean (called centering) and dividing by the standard deviation to shift the distribution to have a mean of zero and a standard deviation of one.

### Normalization
is a rescaling of the data from the original range so that all values are within the new range of 0 and 1.

Normalization requires that you know or are able to accurately estimate the minimum and maximum observable values. You may be able to estimate these values from your available data.

### Standardizing 

a dataset involves rescaling the distribution of values so that the mean of observed values is 0 and the standard deviation is 1.

This can be thought of as subtracting the mean value or centering the data.



more infor on differen standardization und normalization method:
    https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html

In [None]:
### i understand what each does and how we would transform numerical data with it, it is still unclear to me 
#when to use which approach?
#why did andres choose standardizing over normalization here?
# why did you use StandardScaler and what is the fit_transform
