# Clustering
## Given the text of teenagers' Social Networking Service (SNS) pages, identify groups that share common interests such as sports, religion, or music.
**The data include:**
   - 30,000 teenagers 
   - 4 variables indicating personal characteristics - gradyear, gender, age and friends 
   - 36 variables indicating interests (basketball, football, soccer, etc).

## Setup

First, let's import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures. We also check that Python 3.5 or later is installed (although Python 2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as Scikit-Learn ≥0.20.

In [None]:
# Common imports
import numpy as np
import pandas as pd

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Ignore useless warnings
import warnings
warnings.filterwarnings(action="ignore")

# Clustering

### 1. Import data and understand by statistical analysis

In [None]:
# load the CSV file as a dataframe
df = pd.read_csv("social_networking_data.csv")

In [None]:
df.head()

In [None]:
# Get the number of rows and columns
df.shape

In [None]:
# put the original column names in a python list
attr_names = list(df.columns.values)
attr_names

In [None]:
df['gradyear']

In [None]:
# Get the summary statistics of the dataset
df.describe()

# Preprocessing the data - Data imputation

## Check for the column having missing/inappropriate values

In [None]:
# check for the presence of missing values in any of the columns
df.columns[df.isnull().any()]

In [None]:
# How many missing values in each column?
df.isnull().sum()

In [None]:
# Check for outliers - Use box plots
import seaborn as sns
#sns.boxplot(x=df['age'])

In [None]:
df['age'].plot.hist(
  bins = 100,
  title = "Histogram of the age variable"
)

In [None]:
df['age'].plot.box()

In [None]:
plt.figure(figsize=(15,8))
sns.distplot(df.age, bins =30)

In [None]:
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
f, ax = plt.subplots(figsize=(15, 6))
plt.xticks(rotation='90')
sns.barplot(x=missing_data.index, y=missing_data['Percent'])
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
missing_data.head()

#### 2.2 Impute missing values

There are various methods:
- Using mean
- Using mode
- Using median
- Using conditional mean/median/mode
- Using backward/forward fill
- Using another ML algo - K-Nearest Neighbors

so on...

#### Impute Age
A reasonable range of ages for high school students include those who are at least 13 years old 
and not yet 20 years old. Any age value falling outside this range will be treated
the same as missing data because it is not feasible to trust the age provided.

Consider only age between 13 and 20, treat all others as Missing values - NA

In [None]:
# Fill the age <13 or >20 with NaN
df.age[df.age < 13] = np.nan
df.age[df.age > 20] = np.nan

# Impute NaNs in of Age with the average age of their graudation year
df['age'] = df['age'].fillna(df.groupby('gradyear')['age'].transform('mean'))

#### Impute Gender Or remove rows with missing values
It is not a good idea to impute the missing values of Gender using mode/median or any such method. We use a sophistated method such as k-nearest neighbor imputaion.

#### 2.3 Remove unnecessary columns, which we should not use in clustering

In [None]:
df.columns

#### 2.4 Normalize/standardize the data

Standardization of datasets is a common requirement for many machine learning algorithms implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.


In [None]:
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()

In [None]:
df["column_name"] = std_scaler.fit_transform(df["column_name"].values.reshape(-1,1))