<a href="https://colab.research.google.com/github/Mtlukasik/Exploration/blob/main/unity_ds_ui_quiz.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import libraries

- Import additional libraries of your choice.

Although you are expected to demonstrate understanding of ML/DS/statistics tools, a particular choice of libraries and frameworks will not affect evaluation of the solution.

In [1]:
from datetime import datetime
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import seaborn as sns
import keras
import sklearn
import hashlib
import matplotlib.pyplot as plt
import scipy.stats as stats
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
import tensorflow as tf
from tensorflow.keras import Model
from keras.layers import Input, Dense, Reshape
# Import additional libraries of your choice

# Unity Data Science quiz

At Unity, we develop deep learning models for real-time ads bidding ([OpenRTB](https://www.iab.com/guidelines/openrtb/)) at various ad
exchanges. To bid for an ad impression, we estimate the optimal bid value using predicted
install probability of campaign together with several other factors e.g. cost per install.

In this homework, your task is to **train a model to predict *install probabilities* for ad impressions included in
the test data**. In the production environment, the model predictions are used for deriving the optimial bids for available ad campaigns.  The best ad campaign will be shown to the user. Overestimation of install probabilities will lead to unnecessarily high bids and monetary losses, while underestimation of install probabilities will lead to unnecessarily low bids and loss of opportunities for Unity to win ad impressions. Therefore, it is important for the model predictions to be as accurate as possible.

## Instructions

- Complete the homework using Python and libraries of your choice.
- Follow the instructions in this notebook.
- Keep the code clean and organized.

## Evaluation

We focus our evaluation on technical proficiency, analytical skills, problem solving, creative thinking as well as ability to communicate clearly. In particular, we will evaluate:
- understanding of the problem (e.g. does a delivered solution meet the specification of the task)
- quality of discussion and brevity of the report (e.g. comments in this notebook)
- quality of EDA
- feature handling & preprocessing
- modeling approach
- model validation and evaluation

We will separately evaluate the predicted test set install probabilities (see below). Although considered as part of the evaluation, the final performance is not the key factor and shall not dominate over the above dimensions.

## Data description

- ```id```: impression id
- ```timestamp```: time of the event in UTC ```
 all installs happened  long ago this game is probably old```
- ```campaignId```: id of the advertising campaign (the game being advertised)
- ```platform```: device platform
- ```softwareVersion```: OS version of the device
- ```country```: country of user
- ```sourceGameId```: id of the publishing game (the game being played)
- ```startCount```: how many times the user has started (any) campaigns
- ```viewCount```: how many times the user has viewed (any) campaigns
- ```clickCount```: how many times the user has clicked (any) campaigns
- ```installCount```: how many times the user has installed games from this ad network
- ```lastStart```: last time the user started any campaign
- ```startCount1d```: how many times the user has started (any) campaigns within the last 24 hours
- ```startCount7d```: how many times the user has started (any) campaigns within the last 7 days
- ```connectionType```: internet connection type
- ```deviceType```: device model
- ```install```: binary indicator if install was observed (install=1) or not (install=0) after impression

## Submission

- This Jupyter notebook
- A CSV file containing the predicted install probabilities of ad impressions in the test data. The file should have the following columns:
    - ```id```: ID of ad impression in the test data
    - ```install_proba```: Predicted install probability of ad impression


# Analysis
by Mateusz Łukasik
 - start date: 2.05
 - submitted: 6.05

## Load and prepare data

In [2]:
from google.colab import drive
drive.mount('/content/drive')
train_df = pd.read_csv("/content/drive/MyDrive/training_data.csv", sep=";", parse_dates=True)
train_df = train_df[~train_df['install'].isna()]#inputing install label is classification so there is no use in it

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
train_df = train_df[~train_df['lastStart'].isna()]

In [4]:
train_df['lastStart'] = pd.to_datetime(train_df['lastStart'])

In [5]:
train_df = train_df.sort_values(by=['timestamp']).reset_index(drop=True)

In [6]:
train_df.columns

Index(['id', 'timestamp', 'campaignId', 'platform', 'softwareVersion',
       'sourceGameId', 'country', 'startCount', 'viewCount', 'clickCount',
       'installCount', 'lastStart', 'startCount1d', 'startCount7d',
       'connectionType', 'deviceType', 'install'],
      dtype='object')

In [7]:
import pandas as pd

# Ensure 'timestamp' is in datetime format
train_df['timestamp'] = pd.to_datetime(train_df['timestamp'])

# Group the DataFrame by hour
daily_groups = train_df.groupby(pd.Grouper(key='timestamp', freq='D'))

# Initialize a list to hold the sub-DataFrames
daily_dfs = []

# Iterate over each group and add the sub-DataFrame to the list
for _, group in daily_groups:
    if not group.empty:  # Check if the group is not empty
        daily_dfs.append(group)

In [8]:
[print(i.shape) for i in daily_dfs]

(315482, 17)
(313690, 17)
(306825, 17)
(281678, 17)
(232467, 17)
(225278, 17)
(189106, 17)
(201060, 17)
(214095, 17)
(211327, 17)
(223277, 17)
(257102, 17)
(272655, 17)
(219667, 17)


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

In [9]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

import pandas as pd
from sklearn.utils import resample

class DataPreprocessor:
    def __init__(self, numerical_columns, target_column):
        """
        Initialize the preprocessor with the list of numerical columns
        to keep and convert to floats, and the name of the target column for undersampling.
        """
        self.numerical_columns = numerical_columns
        self.target_column = target_column

    def preprocess(self, df):
        # Ensure the target column is included for processing
        columns = self.numerical_columns + [self.target_column]
        self.processed_df = df[columns].copy()

        # Convert numerical columns to float dtype
        for col in self.numerical_columns:
            self.processed_df[col] = pd.to_numeric(self.processed_df[col], errors='coerce').astype(float)

        # Perform undersampling
        self.df_majority = self.processed_df[self.processed_df[self.target_column] == 0]
        self.df_minority = self.processed_df[self.processed_df[self.target_column] != 0]
        self.len_ratio = self.df_majority.shape[0]/(self.df_majority.shape[0]+self.df_minority.shape[0])
        # Downsample the majority class
        df_majority_downsampled = resample(self.df_majority,
                                           replace=False,    # sample without replacement
                                           n_samples=len(self.df_minority),  # to match minority class size
                                           random_state=123) # reproducible results

        # Combine minority class with downsampled majority class
        balanced_df = pd.concat([self.df_minority, df_majority_downsampled])

        return balanced_df

import tensorflow as tf

class MyModel:
    def __init__(self, input_shape):
        self.model = self._build_model(input_shape)

    def _build_model(self, input_shape):
        # Define a simple Sequential model
        model = tf.keras.Sequential([
            tf.keras.layers.Dense(64, activation='relu', input_shape=(input_shape,)),
            tf.keras.layers.Dense(64, activation='relu'),
            tf.keras.layers.Dense(1, activation='sigmoid')  # Assuming binary classification
        ])
        model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
        return model

    def fit(self, X, y, epochs=10, validation_split=0.2):
        self.model.fit(X, y, epochs=epochs, validation_split=validation_split)

    def predict(self, X):
        return self.model.predict(X)

In [10]:
class Postprocessor:
    def __init__(self, model, adjusted_threshold):
        self.model = model
        self.adjusted_threshold = adjusted_threshold

    def predict(self, X):
        # Get the model's prediction probabilities
        predictions = self.model.predict(X)

        # Apply the adjusted threshold to these probabilities to get binary predictions
        binary_predictions = (predictions > self.adjusted_threshold).astype(int)
        return binary_predictions


In [11]:
numerical_columns = ['startCount', 'viewCount', 'clickCount', 'installCount', 'startCount1d', 'startCount7d']
target_column = 'install'  # Assuming 'install' is your target column

# Initialize the preprocessor
preprocessor = DataPreprocessor(numerical_columns=numerical_columns, target_column=target_column)

# Preprocess your DataFrame
balanced_df = preprocessor.preprocess(daily_dfs[0])
X = balanced_df.drop(columns=[target_column])
y = balanced_df[target_column]
# Convert the target to a numeric type, if it's not already
y = pd.to_numeric(y, errors='coerce')
input_shape = X.shape[1]
# Initialize and train the model
my_model = MyModel(input_shape)
my_model.fit(X, y)
postprocessor = Postprocessor(my_model,preprocessor.len_ratio)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [12]:
#ewaluacja
B = postprocessor.predict(preprocessor.processed_df.drop(columns=[target_column]))
sum(B)/len(B)



array([0.01601042])

In [14]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
for i in enumerate(daily_dfs):
  if i[0]>0.0:
    print(i[0])
    preprocessor_new = DataPreprocessor(numerical_columns=numerical_columns, target_column=target_column)
    preprocessor_new.preprocess(daily_dfs[i[0]])
    predictions_on_new_day = postprocessor.predict(preprocessor_new.processed_df.drop(columns=[target_column]))
    # Calculate metrics
    accuracy_pre = accuracy_score(preprocessor_new.processed_df[target_column], predictions_on_new_day)
    precision_pre = precision_score(preprocessor_new.processed_df[target_column], predictions_on_new_day)
    recall_pre = recall_score(preprocessor_new.processed_df[target_column], predictions_on_new_day)
    f1_pre = f1_score(preprocessor_new.processed_df[target_column], predictions_on_new_day)
    roc_auc_pre = roc_auc_score(preprocessor_new.processed_df[target_column], predictions_on_new_day)

    print(f"Accuracy: {accuracy_pre}")
    print(f"Precision: {precision_pre}")
    print(f"Recall: {recall_pre}")
    print(f"F1 Score: {f1_pre}")
    print(f"ROC-AUC Score: {roc_auc_pre}")
    predictions_on_new_day_flat = predictions_on_new_day.flatten()
    is_positive_prediction = predictions_on_new_day_flat.astype(bool)
    selected_rows = preprocessor_new.processed_df[is_positive_prediction]
    train_df_updated = pd.concat([preprocessor.processed_df, selected_rows])
    print(f"New length for ith:{i[0]} iteration {train_df_updated.shape} now proceeding to train model again:")
    numerical_columns = ['startCount', 'viewCount', 'clickCount', 'installCount', 'startCount1d', 'startCount7d']
    target_column = 'install'  # Assuming 'install' is your target column

    # Initialize the preprocessor
    preprocessor = DataPreprocessor(numerical_columns=numerical_columns, target_column=target_column)

    # Preprocess your DataFrame
    balanced_df = preprocessor.preprocess(train_df_updated)
    X = balanced_df.drop(columns=[target_column])
    y = balanced_df[target_column]
    # Convert the target to a numeric type, if it's not already
    y = pd.to_numeric(y, errors='coerce')
    input_shape = X.shape[1]
    # Initialize and train the model
    my_model = MyModel(input_shape)
    my_model.fit(X, y)
    print(f"preprocessor.len_ratio: {preprocessor.len_ratio}")
    postprocessor = Postprocessor(my_model,preprocessor.len_ratio)

1
Accuracy: 0.9724281934393828
Precision: 0.008860011813349085
Recall: 0.012295081967213115
F1 Score: 0.010298661174047374
ROC-AUC Score: 0.4980289718128812
New length for ith:1 iteration (320561, 7) now proceeding to train model again:
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
preprocessor.len_ratio: 0.9890566849991109
2
Accuracy: 0.9868785138108042
Precision: 0.018456375838926176
Recall: 0.003186558516801854
F1 Score: 0.005434782608695652
ROC-AUC Score: 0.5006291196265287
New length for ith:2 iteration (321157, 7) now proceeding to train model again:
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
preprocessor.len_ratio: 0.989042742334746
3
Accuracy: 0.9879827320557516
Precision: 0.0411522633744856
Recall: 0.0031625553447185324
F1 Score: 0.005873715124816445
ROC-AUC Score: 0.5011629893154964
New length for ith:3 iteration (321400, 7) now proceeding to tr