# The problem 

We are trying to predict if a given patient will go to their appointment or not. 
This is a binary classification problem.

## Import the dataset

In [None]:
import pandas as pd

df = pd.read_csv('KaggleV2-May-2016.csv')

Now let's take a look at the data

In [None]:
df.head()

In [None]:
df.describe()

We have 14 columns, and 110527 rows.

In [None]:
df.duplicated().sum()

We have no duplicated rows.
Now let's fix that column name and check for missing values.

In [None]:
df.rename(columns={'Hipertension':'Hypertension', 'Handcap':'Handicap'}, inplace=True)
df.columns = df.columns.str.lower().str.replace('-', '_')

df.head()

In [None]:
df.query('age < 0')

In [None]:
df.drop(df.query('age < 0').index, inplace=True)

df['appointmentid'].count()

The data is now clean.

# The analysis


Let's prepare the data

In [None]:
from sklearn.preprocessing import MinMaxScaler, LabelEncoder

X = df.drop(['no_show', 'appointmentid', 'patientid'], axis=1)

y = df['no_show'].map({'No': 0, 'Yes': 1})

encoder = LabelEncoder()
features = X.columns
X = X.apply(encoder.fit_transform)

scaler = MinMaxScaler()
X = scaler.fit_transform(X)

X = pd.DataFrame(X, columns=features)
X

In [None]:
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=42)
X_reduced = tsne.fit_transform(X)
plt.figure(figsize=(13,10))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap="jet")