# TalkingData challenge - Exploratory Data Analysis

The TalkingData challenge is a multi-class classification problem. It asks to determine age and gender of mobile phone subscribers based on their app usage profile.

First, we import standard libraries and load the training data.

In [None]:
%matplotlib inline
import matplotlib.pylab as plt
import matplotlib.patches as mpatches
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(color_codes=True)

pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [None]:
file_paths = list(map(lambda x: "../input/" + x + ".csv", ["app_events", "app_labels", "events", "gender_age_train", "label_categories"]))
 
data_sets = list(map(pd.read_csv, file_paths))

Second, we check the sizes of the data sets. App_events has more than 32 million rows. The actual training set has about 460k rows.

In [None]:
list(map(lambda x: x.shape, data_sets))

Now, we inspect the first few rows to get a feeling for the data set.

In [None]:
for data_set in data_sets:
    print(data_set.head())

## Basic characteristics

After having the data pandas-ready, let's start by visualizing a selection of simple characteristics.

### App statistics

About 20000 apps are used. Not surprisingly, the usage pattern is heavily skewed. On average, each app appears in roughly 1700 events, but the median is at a mere 33 events. 

In [None]:
apps = data_sets[0]['app_id']
print(apps.nunique())
value_count_app = apps.value_counts()
print(value_count_app.describe())

Let's visualize the distribution of less popular apps. 

In [None]:
threshold = 200
unpop_apps = sns.kdeplot(value_count_app[value_count_app < threshold].values, shade = True)
unpop_apps.set_title('Density plot for apps of low popularity')

Let's also have a look at the other end of the spectrum. While the most popular app appears in almost 120k events, that number drops below 20k fairly quickly. 

In [None]:
BAR_COUNT = 100
plt.figure()
ax = value_count_app.iloc[:BAR_COUNT].plot.bar()
ax.axes.get_xaxis().set_visible(False)
ax.set_title('Most popular apps')

Out of curiosity, what are the labels associated with the hottest apps? No Pokémon Go, yet. We see instant messengers, a payment app and a fashion app.

In [None]:
pd.options.display.max_colwidth=80

label_cat = data_sets[4]
al = data_sets[1]

TOP_NUM = 5
top_apps = list(value_count_app.iloc[:TOP_NUM].index)
app_cat = pd.merge(al, label_cat, left_on = 'label_id', right_on = 'label_id').loc[:,['app_id','category']]
pd.concat([pd.DataFrame(app_cat[app_cat['app_id'] == app_id].groupby('app_id').aggregate(lambda x: tuple(x))['category'].values) 
           for app_id in top_apps])

### Gender and age statistics

A quick look at the data reveals that the training set is substantially gender-imbalanced.

In [None]:
gender_age = data_sets[3]
gender_value_counts = gender_age['gender'].value_counts(normalize = True)
gender_plot = gender_value_counts.plot.barh()
gender_plot.set_title('Gender frequency')

On the other hand, the age distribution is similar for both genders.

In [None]:
BINS = range(0,80,2)
gender_age_pivot = gender_age.loc[:,['gender','age']].pivot(columns = 'gender', values = 'age')

plt.figure()

age_female = gender_age_pivot['F'].plot.hist(normed = True, bins = BINS, alpha = 0.5)
age_female.set_title("Age distribution by gender")
age_male = gender_age_pivot['M'].plot.hist(normed = True, bins = BINS, alpha = 0.5)

female_patch = mpatches.Patch(color='blue', alpha = 0.5, label='female')
male_patch = mpatches.Patch(color='green', alpha = 0.5, label='male')
plt.legend(handles=[female_patch, male_patch])

plt.show()

## App usage by age and gender

In order to develop an intuition on the features that could be used to discriminate between the sexes and age groups, we investigate how the most popular apps differ when looking at the different groups. For this, we select first for each device individually the k most frequently used apps.

In [None]:
events = data_sets[2]
TOP_APP_NUM = 10

device_events = pd.merge(gender_age, events, left_on = 'device_id', right_on = 'device_id').loc[:,['device_id','event_id']]
device_events_apps = pd.merge(device_events, data_sets[0], left_on = 'event_id', right_on = 'event_id').loc[:,['device_id','app_id']]

In [None]:
most_used = []
most_used_aux = device_events_apps.groupby('device_id').agg(lambda x: list(x.value_counts().index[0:TOP_APP_NUM]))['app_id']
most_used_aux.reset_index().apply(lambda row: [most_used.append([row['device_id'], app]) for app in row['app_id']], 
                                  axis=1)
most_used = pd.DataFrame(most_used, columns = ['device_id', 'app_id'])

We merge this individual app usage data frame back into the gender_age data and determine the most frequently used apps separately for men and women. There is substantial agreement, but the apps on rank 7-10 are distinct. For instance, concerning women there is a photography app on rank 9, whereas rank 10 for men is a taxi app.

In [None]:
gender_age_top_apps = pd.merge(gender_age, most_used, on = 'device_id')

top_female_apps = gender_age_top_apps[gender_age_top_apps['gender'] == 'F'].loc[:, 'app_id'].value_counts().index[0:TOP_APP_NUM]
top_male_apps = gender_age_top_apps[gender_age_top_apps['gender'] == 'M'].loc[:, 'app_id'].value_counts().index[0:TOP_APP_NUM]

pd.concat([pd.DataFrame(app_cat[app_cat['app_id'] == app_id].groupby('app_id').aggregate(lambda x: tuple(x))['category'].values) 
           for app_id in top_female_apps])

In [None]:
pd.concat([pd.DataFrame(app_cat[app_cat['app_id'] == app_id].groupby('app_id').aggregate(lambda x: tuple(x))['category'].values) 
           for app_id in top_male_apps])

Finally, let's take age into account. How do the most popular apps used by young users differ from the ones favored by their older peers? Again, on the top of the list there is agreement, whereas further down the younger ones use a music app, whereas for the older group a health care app appears.

In [None]:
gender_age_top_apps = pd.merge(gender_age, most_used, on = 'device_id')

top_young_apps = gender_age_top_apps[gender_age_top_apps['group'].isin(['M22-','F23-'])].loc[:, 'app_id'].value_counts().index[0:TOP_APP_NUM]
top_old_apps = gender_age_top_apps[gender_age_top_apps['group'].isin(['M39+','F43+'])].loc[:, 'app_id'].value_counts().index[0:TOP_APP_NUM]

pd.concat([pd.DataFrame(app_cat[app_cat['app_id'] == app_id].groupby('app_id').aggregate(lambda x: tuple(x))['category'].values) 
           for app_id in top_young_apps])

In [None]:
pd.concat([pd.DataFrame(app_cat[app_cat['app_id'] == app_id].groupby('app_id').aggregate(lambda x: tuple(x))['category'].values) 
           for app_id in top_old_apps])