<a href="https://colab.research.google.com/github/HarshESC/Company-Churn-Supervised-Learning-Analysis/blob/main/Mobile_User_Demographics_Proposal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Today, I am exploring the dataset: [TalkingData Mobile User Demographic](https://https://www.kaggle.com/c/talkingdata-mobile-user-demographics/data.) from Kaggle, by user and Chinese Mobile Platforming firm, **TalkingData**




## 1. Introduction.
---
Nothing is more comforting than being greeted by your favorite drink just as you walk through the door of the corner café. While a thoughtful barista knows you take a macchiato every Wednesday morning at 8:15, it’s much more difficult in a digital space for your preferred brands to personalize your experience.


TalkingData, China’s largest third-party mobile data platform, understands that everyday choices and behaviors reflect ourselves and what we value most.

TalkingData wants to leverage behavioral data from over 70% of the 500 million mobile devices as of 2016, active daily in China in order to help its clients better understand and interact with their audiences.

The data consists of app downloads and usage behaviors, as well as the age and gender of the user.








### 2. Datasets 
---

In [1]:
#EDA Imports
from google.colab import drive 
drive.mount('/content/gdrive')
import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/"
%cd "/content/gdrive/My Drive/Final Capstone"
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MultiLabelBinarizer



Mounted at /content/gdrive
/content/gdrive/My Drive/Final Capstone


In [2]:
# Reading each data file
app_events = pd.read_csv("app_events.csv")
app_labels = pd.read_csv("app_labels.csv")
events = pd.read_csv("events.csv")
gender_age_train = pd.read_csv("gender_age_train.csv")
gender_age_test = pd.read_csv("gender_age_test.csv")
category_labels = pd.read_csv("label_categories.csv")
phone_brand = pd.read_csv("phone_brand_device_model.csv")


In [3]:
app_labels.head()

Unnamed: 0,app_id,label_id
0,7324884708820027918,251
1,-4494216993218550286,251
2,6058196446775239644,406
3,6058196446775239644,407
4,8694625920731541625,406


In [4]:
events.head()

Unnamed: 0,event_id,device_id,timestamp,longitude,latitude
0,1,29182687948017175,2016-05-01 00:55:25,121.38,31.24
1,2,-6401643145415154744,2016-05-01 00:54:12,103.65,30.97
2,3,-4833982096941402721,2016-05-01 00:08:05,106.6,29.7
3,4,-6815121365017318426,2016-05-01 00:06:40,104.27,23.28
4,5,-5373797595892518570,2016-05-01 00:07:18,115.88,28.66


In [5]:
app_labels.head()


Unnamed: 0,app_id,label_id
0,7324884708820027918,251
1,-4494216993218550286,251
2,6058196446775239644,406
3,6058196446775239644,407
4,8694625920731541625,406


In [6]:
category_labels.head()


Unnamed: 0,label_id,category
0,1,
1,2,game-game type
2,3,game-Game themes
3,4,game-Art Style
4,5,game-Leisure time


In [7]:
gender_age_train.head()


Unnamed: 0,device_id,gender,age,group
0,-8076087639492063270,M,35,M32-38
1,-2897161552818060146,M,35,M32-38
2,-8260683887967679142,M,35,M32-38
3,-4938849341048082022,M,30,M29-31
4,245133531816851882,M,30,M29-31


In [8]:
phone_brand.head()


Unnamed: 0,device_id,phone_brand,device_model
0,-8890648629457979026,小米,红米
1,1277779817574759137,小米,MI 2
2,5137427614288105724,三星,Galaxy S4
3,3669464369358936369,SUGAR,时尚手机
4,-5019277647504317457,三星,Galaxy Note 2


## 3. Question
---
The big question I plan to tackle today is:

#*Is possible to predict the gender and age of the user by using  certain details on apps and phone usage?*


#4. Audience & Use Case
---
This information would be most beneficial for advertiser because this can help create targeted ads towards genders and certain age groups.  This in return helps App Developers and SmartPhone brands raise their target demographic's awareness of their apps and brands, and shift focus on particular markets.  




# 5. Methods
---
I will be using 8 CSV files from Kaggle to answer the question.
Each CSV file contains information ranging from age and gender, to phone brand.

App labels and label categories are also included. I will be merging some of the files and create larger datasets.  The app label and label categories will be one of the files merged together because for ease of accessibility.  The gender and and age files will also be merged in order to make simplify the creation of predictive modeling and clustering algorithms.

## Techniques applied on this project:
 1. Exploratory Data Analysis 
- Data Cleaning
- Merging originally split datasets

 2. Supervised Learning
- Create 6-7 Classification Models (using PCA and SelectKBest)
- Use Gridsearch to determine best hyper parameters
- Generate Classification Reports for each model
- Compare models against each other and  determine which classification model is the most accurate for predicting age and gender 
- Determine which dimensionality reduction method was the most accurate helpful when creating the models

Each model will predict Age and Gender from App User Information

 3. Unsupervised Learning
- 2 dimensionality reduction methods (PCA/K-means, T-SNE/GMM) will be used and applied to 2-3 clustering algorithms
- Use clustered data to see what featuers are grouped together and how they impact my classification models
- Compare the clustering algorithms using silhouette scores.

 4. Deep Learning
- I will be creating a CNN (Convolutional Neural Network)
that will also be used as a classification model 
- Compare results with precision and recall scores with the Supervised Learning models and determine which produced the best model.


### Exploratory Data Analysis

Merging Datasets together

In [9]:
# Merging app_events and event_id through event_id column
merged_events = pd.merge(app_events, events, on="event_id")


In [10]:
# Merging app_labels and category_labels through label_id
merged_apps = pd.merge(app_labels, category_labels, on="label_id")


In [None]:
# Merging previous merged data tables together to make larger data frame
merged_events_apps = pd.merge(merged_events, merged_apps, on="app_id")


In [None]:
# Adding phone brand to merged_events_apps through device_id merge
merged_events_apps_brands = pd.merge(merged_events_apps, phone_brand, on="device_id")


In [None]:
# Creating final data frame by merging gender and age with previous dataframe
final_df = pd.merge(merged_events_apps_brands, gender_age_train, on="device_id")
final_df.head()


Translating phone brands to english

In [None]:
to_english = {
    "华为": "huawei",  # manually translated and entered
    "小米": "xiaomi",  # manually translated and entered
    "魅族": "meizu",  # manually translated and entered
    "vivo": "vivo",  # manually translated and entered
    "酷派": "coolpad",  # manually translated and entered
    "索尼": "sony",  # manually translated and entered
    "OPPO": "oppo",  # manually translated and entered
    "LG": "lg",  # manually translated and entered
    "HTC": "htc",  # manually translated and entered
    "金立": "gionee",  # manually translated and entered
    "中兴": "zte",  # manually translated and entered
    "奇酷": "qiku",  # manually translated and entered
    "TCL": "tcl",  # manually translated and entered
    "三星": "samsung",
    "天语": "Ktouch",
    "海信": "hisense",
    "联想": "lenovo",
    "欧比": "obi",
    "爱派尔": "ipair",
    "努比亚": "nubia",
    "优米": "youmi",
    "朵唯": "dowe",
    "黑米": "heymi",
    "锤子": "hammer",
    "酷比魔方": "koobee",
    "美图": "meitu",
    "尼比鲁": "nibilu",
    "一加": "oneplus",
    "优购": "yougo",
    "诺基亚": "nokia",
    "糖葫芦": "candy",
    "中国移动": "ccmc",
    "语信": "yuxin",
    "基伍": "kiwu",
    "青橙": "greeno",
    "华硕": "asus",
    "夏新": "panosonic",
    "维图": "weitu",
    "艾优尼": "aiyouni",
    "摩托罗拉": "moto",
    "乡米": "xiangmi",
    "米奇": "micky",
    "大可乐": "bigcola",
    "沃普丰": "wpf",
    "神舟": "hasse",
    "摩乐": "mole",
    "飞秒": "fs",
    "米歌": "mige",
    "富可视": "fks",
    "德赛": "desci",
    "梦米": "mengmi",
    "乐视": "lshi",
    "小杨树": "smallt",
    "纽曼": "newman",
    "邦华": "banghua",
    "E派": "epai",
    "易派": "epai",
    "普耐尔": "pner",
    "欧新": "ouxin",
    "西米": "ximi",
    "海尔": "haier",
    "波导": "bodao",
    "糯米": "nuomi",
    "唯米": "weimi",
    "酷珀": "kupo",
    "谷歌": "google",
    "昂达": "ada",
    "聆韵": "lingyun",
}
