## 1th Class
> Date: April 02 (2020)

- Library that will be used: sklearn
- **Problem:** Consider that you are working to a video streaming platform and need to make a classifier to define what movie someone want watch based on what this person whatched before.


In [1]:
import pandas as pd

uri = 'https://raw.githubusercontent.com/alura-cursos/machine-learning-algoritmos-nao-supervisionados/master/movies.csv'
movies = pd.read_csv(uri)

# If you want program using pt-BR:
# movie.columns = ['filmes_id', 'Nome', 'Genero']
# To use it, you'll need to change all the variable names 
# in all the code

movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [2]:
genres = movies.genres.str.get_dummies() # Transform the genres in columns with a binary classification 
movies = pd.concat([movies, genres], axis = 1)
# movies = movies.drop(['genres'], axis = 1) # To remove the column 'genres'
movies.head()

Unnamed: 0,movieId,title,genres,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0,0,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),Adventure|Children|Fantasy,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),Comedy|Romance,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II (1995),Comedy,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
# Scalling the values
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_genres = scaler.fit_transform(genres)
scaled_genres.shape

(9742, 20)

In [4]:
scaled_genres

array([[-0.05917995, -0.48060698,  2.59101927, ..., -0.49125897,
        -0.20201972, -0.13206534],
       [-0.05917995, -0.48060698,  2.59101927, ..., -0.49125897,
        -0.20201972, -0.13206534],
       [-0.05917995, -0.48060698, -0.3859485 , ..., -0.49125897,
        -0.20201972, -0.13206534],
       ...,
       [-0.05917995, -0.48060698, -0.3859485 , ..., -0.49125897,
        -0.20201972, -0.13206534],
       [-0.05917995,  2.0807022 , -0.3859485 , ..., -0.49125897,
        -0.20201972, -0.13206534],
       [-0.05917995, -0.48060698, -0.3859485 , ..., -0.49125897,
        -0.20201972, -0.13206534]])

## 2th Class
> Date: April 12 (2020)

### Knowing the KMeans
- **Problem:** Now we have a list of movies where each one belong to some genres. What is showing this information to our classifier is a lot of binary columns indicating if the movie belong or not to the indicated genre in the head.

In [61]:
from sklearn.cluster import KMeans
import numpy as np

SEED = 20
np.random.seed(SEED)

# n_clusters = number of groups to be considered in the classification
model = KMeans(n_clusters = 3) 
model.fit(scaled_genres)
print(f'Groups: {(model.labels_)}') # or
# print('Groups: {}'.format(model.labels_)) # is the same command

Groups: [0 0 1 ... 2 0 1]


### How the KMeans works

1. Create centroids according to the number of clusters defined before;

2. Separate the samples in n (n = number of clusters) areas, being that this areas are separated by a line that is between 2 consecutive centroids, exactly in the center, the half distance between them;

3. recentralize the centroids to be in the center of the samples, the place are the distance is as shorter as possible;

4. Do the steps 2 and 3 until the centroids no more need to move. 