# Exploratory Data Analysis with Python

A processed version of the Kaggle's [Risk of being drawn into online sex work 
dataset](https://www.kaggle.com/panoskostakos/online-sex-work).

This notebook explores the differences in a number of features/attributes
between the group of low-risk users and that of high-risk users in the dataset
through visualization. A graph visualization indicating the network of users
who are registered as friends in the online forum is also included.


## Loading data


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('data/cleaned_online_sex_work.csv', index_col=0)
df = df.iloc[: 28831, :]
df = df[~ df.index.duplicated(keep='first')]
df.head()


In [None]:
train_df = df[df['Risk'].isnull() == False]
train_df['Risk'] = train_df['Risk'].astype(int)
norisk_df = train_df[train_df['Risk'] == 0]
risk_df = train_df[train_df['Risk'] != 0]

print(train_df.shape)


## Comparing low- and high-risk data points via bar charts


In [None]:
f, ax = plt.subplots(2, 2, figsize=(20, 10))

sns.countplot(x='Female', hue='Risk', data=train_df, ax=ax[0][0])

sns.distplot(norisk_df['Age'], kde_kws={'label': 'Low Risk'}, ax=ax[0][1])
sns.distplot(risk_df['Age'], kde_kws={'label': 'High Risk'}, ax=ax[0][1])

sns.countplot(x='Location', hue='Risk', data=train_df, ax=ax[1][0])

sns.countplot(x='Verification', hue='Risk', data=train_df, ax=ax[1][1])

plt.show()


Some interpretation could be achieved from these visualizations:
- The ratio of the number of low-risk users to that of high-risk users is high,
as is the ratio of the number of males to the number of females. In other words,
there are more low-risk than high-risk users, and more male than female users.
- Most users are in their late 20s to mid 60s, no clear different age
distributions between low-risk and high-risk users.
- There are more low-risk than high-risk users in most registered locations,
expect for location M and N.
- Most users are not registered in the forum.

Possible features to explore: orientation, polarity, looking-for, etc.


## Exploring distribution of numerical attributes


In [None]:
f, ax = plt.subplots(1, 2, figsize=(15, 5))
sns.distplot(norisk_df['Points_Rank'], kde_kws={'label': 'Low Risk'}, ax=ax[0])
sns.distplot(risk_df['Points_Rank'], kde_kws={'label': 'High Risk'}, ax=ax[1])

plt.show()


Possible features to explore: online activity statistics, number of friends


## Feature correlation matrix


In [None]:
corr_matrix = train_df.drop(['Friends_ID_list'], axis=1).corr()

f, ax = plt.subplots(1, 1, figsize=(15, 10))
sns.heatmap(corr_matrix)

plt.show()


## Insights from machine learning


In [None]:
from sklearn.svm import LinearSVC


location_means = train_df.groupby('Location').mean()['Risk']
train_df['Location'] = train_df['Location'].map(location_means)
X_train = train_df.drop(['Friends_ID_list', 'Risk'], axis=1)
y_train = train_df['Risk']

clf = LinearSVC()
clf.fit(X_train, y_train)

nfeatures = 10

coef = clf.coef_.ravel()
top_positive_coefs = np.argsort(coef)[-nfeatures :]
top_negative_coefs = np.argsort(coef)[: nfeatures]
top_coefs = np.hstack([top_negative_coefs, top_positive_coefs])

plt.figure(figsize=(15, 5))
colors = ['red' if c < 0 else 'blue' for c in coef[top_coefs]]
plt.bar(np.arange(2 * nfeatures), coef[top_coefs], color = colors)
feature_names = np.array(X_train.columns)
plt.xticks(np.arange(0, 1 + 2 * nfeatures), feature_names[top_coefs], rotation=60, ha='right')

plt.show()


Here we see that no particular column label has a large absolute coefficient
value, so we say that no feature is significantly important to determine the
risk of a specific user. We however see that features like `Number_of_Friends`,
`Member_since_day`, `Number of Comments in public forum`, etc. features
representing online activeness all have negative coefficients, so we can guess
that online activeness may decrease the probability of a user's risk.

Other observations might include:
- The field `Female` has a near-zero coefficient - some can say gender plays no
role in determining a user's online risk.
- `Submissive`'s, though low, is the most positive coefficient.

