<a href="https://colab.research.google.com/github/SFauth/learning_scikit_learn/blob/main/scikit_learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lerne Scikit-learn (Machine Learning in Python)

Installieren der Pakete

In [None]:
!pip install scikit-learn
!pip install tensorflow
!pip install plotly

Laden der Pakete

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
from google.colab import drive 
from sklearn.pipeline import Pipeline   
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PoissonRegressor
from sklearn.ensemble import HistGradientBoostingRegressor

Festlegen des Arbeitsverzeichnisses

In [None]:
drive.mount("/content/gdrive", force_remount=True)
%cd /content/gdrive/MyDrive/Learning_ML_and_AI/

Mounted at /content/gdrive
/content/gdrive/MyDrive/Learning_ML_and_AI


## Data Wrangling mit Pandas und Numpy

Einlesen der Daten mit Pandas

In [None]:
df = pd.read_csv("influencer_data/top_insta_influencers_data.csv")
df.head(5)

Unnamed: 0,rank,channel_info,influence_score,posts,followers,avg_likes,60_day_eng_rate,new_post_avg_like,total_likes,country
0,1,cristiano,92,3.3k,475.8m,8.7m,1.39%,6.5m,29.0b,Spain
1,2,kyliejenner,91,6.9k,366.2m,8.3m,1.62%,5.9m,57.4b,United States
2,3,leomessi,90,0.89k,357.3m,6.8m,1.24%,4.4m,6.0b,
3,4,selenagomez,93,1.8k,342.7m,6.2m,0.97%,3.3m,11.5b,United States
4,5,therock,91,6.8k,334.1m,1.9m,0.20%,665.3k,12.5b,United States


Welche Datentypen haben wir? Müssen wir was verändern?

In [None]:
df.dtypes

rank                  int64
channel_info         object
influence_score       int64
posts                object
followers            object
avg_likes            object
60_day_eng_rate      object
new_post_avg_like    object
total_likes          object
country              object
dtype: object

Der Klassiker: viele Zahlen sind als String definiert, obwohl es Zahlen sind. Also sollten wir diese Spalten (Variablen) transformieren. 

In [None]:
kmb_columns = ['avg_likes', 'posts', 'followers', 'new_post_avg_like', 'total_likes']

kmb_df = df[kmb_columns].replace({'k': '*1e3', 'm': '*1e6', 'b': '1e9'}, regex=True).applymap(pd.eval).astype(int)

kmb_df.head(5)

other_columns = df.drop(kmb_columns, axis=1)

df = pd.concat([other_columns, kmb_df], axis=1)

df.head(5)

Unnamed: 0,rank,channel_info,influence_score,60_day_eng_rate,country,avg_likes,posts,followers,new_post_avg_like,total_likes
0,1,cristiano,92,1.39%,Spain,8700000,3300,475800000,6500000,29010000000
1,2,kyliejenner,91,1.62%,United States,8300000,6900,366200000,5900000,57410000000
2,3,leomessi,90,1.24%,,6800000,890,357300000,4400000,6010000000
3,4,selenagomez,93,0.97%,United States,6200000,1800,342700000,3300000,11510000000
4,5,therock,91,0.20%,United States,1900000,6800,334100000,665300,12510000000


Das Land ist auch noch nicht numerisch. Welche Werte kommen denn vor? Können wir das einfach als Dummy coden?

In [None]:
df.country.value_counts().head(5)

United States    66
Brazil           13
India            12
Indonesia         7
France            6
Name: country, dtype: int64

Da wir zu viele Variablen bekommen würden, könnte man hier einfach sagen United States oder nicht.

In [None]:
df['country'] = np.where(df['country'] == "United States", "US", "Not_US")

Nur noch 60 day engagement rate übrig!

In [None]:
df['60_day_eng_rate'] = df['60_day_eng_rate'].str.rstrip('%').astype('float') / 100.0

Fertig ist unser Dataframe, den wir jetzt mit Scikit-Learn verarbeiten können! 

In [None]:
df.head(5)

Unnamed: 0,rank,channel_info,influence_score,60_day_eng_rate,country,avg_likes,posts,followers,new_post_avg_like,total_likes
0,1,cristiano,92,0.0139,Not_US,8700000,3300,475800000,6500000,29010000000
1,2,kyliejenner,91,0.0162,US,8300000,6900,366200000,5900000,57410000000
2,3,leomessi,90,0.0124,Not_US,6800000,890,357300000,4400000,6010000000
3,4,selenagomez,93,0.0097,US,6200000,1800,342700000,3300000,11510000000
4,5,therock,91,0.002,US,1900000,6800,334100000,665300,12510000000


Noch eine kurze graphische Übersicht: wir möchten den influence_score vorhersagen. Welche Werte kann dieser Score annehmen? 

In [None]:
fig = px.histogram(df, x="influence_score", nbins=72)
fig.show()

Wir sehen, dass dieser Wert nur positive natürlich Zahlen (Integer) von 22 bis 93 annehmen kann. Daher macht hier kein Regressionsmodell Sinn, das den ganzen Raum aller reeller Zahlen vorhersagen kann, sondern ein Generalized Linear Model, das nur positive Integer vorhersagen kann. In unserem Fall können wir den influence_score als Poisson-verteilte Variable modellieren.
Man könnte auch darüber nachdenken, alle Werte unter 70 auszuschließen, da dass extreme Ausreißer sind, die das Training des Modells unnatürlich verschlechtern könnten.

## Scikit-Learn

Preprocessing Schritte definieren

In [None]:
dummy_maker = Pipeline(steps=[
         ('dummy_coder', OneHotEncoder(sparse=False, drop='first', handle_unknown='ignore'))
        ])  

normalizer = Pipeline(steps=[
    ('standardizer', MinMaxScaler())
])

preprocessor = ColumnTransformer(
           transformers=[
             ('one_hot_encoder', dummy_maker, ['country']),
             ('scaler', normalizer, ['avg_likes', 'posts', 'followers', 'new_post_avg_like', 'total_likes'])
            ]
        )

Regressionsmodelle definieren

In [None]:
GLM_poisson = PoissonRegressor()
boosting_poisson = HistGradientBoostingRegressor(loss='poisson')

Hyperparameter definieren

In [None]:
params_GLM = {'regressor__alpha' : np.linspace(0, 1, 10),
              'regressor' : [GLM_poisson]}

In [None]:
params_boosting = {'regressor__max_depth' : range(2, 4),
                   'regressor__max_features' : range(2, 4),
                   'regressor' : [boosting_poisson]}

In [None]:
hyperparams = [params_GLM, params_boosting]

Preprocessing anwenden und Zielvariable extrahieren

In [None]:
X = pd.DataFrame(preprocessor.fit_transform(df))
y = df['influence_score']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=486)

1. Modell definieren

In [None]:
model_1 = PoissonRegressor()


2. Modell trainieren

In [None]:
model_1.fit(X_train, y_train)

PoissonRegressor()

3. Vorhersagen zum Testen machen

In [None]:
predictions = pd.DataFrame(model_1.predict(X_test))

Vorhersagen vs. Tatsächliche Werte

In [None]:
test_df = pd.concat([y_test.reset_index().drop('index', axis=1), predictions], axis=1).rename(columns={0:'prediction'})

In [None]:
px.scatter(test_df, x='influence_score', y='prediction')