***Unsupervised Learning***

## **Clustering**

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them.

<img src = "https://media.geeksforgeeks.org/wp-content/uploads/merge3cluster.jpg">

<img src = "https://media.geeksforgeeks.org/wp-content/uploads/clusteringg.jpg">


## **K Means Clustering**

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

In [46]:
#Importing libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import collections #For fetching dictionary of labels & clusters
import nltk #Natural Language Toolkit
nltk.download('stopwords')
nltk.download('punkt')
from nltk import word_tokenize #Word tokenization is the process of splitting a large sample of text into words.
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer #Normalizing Sentences
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from pprint import pprint

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## **Loading a Dataset**

**Self Made Dataset**

In [47]:
# My Dataset contains multiple sentences related to
# Graphics, Snooker, Investment, Software Engineering, Artificial Intelligence, Aviation & Love

sentences = pd.read_csv("/content/drive/My Drive/Datasets/Quotes.csv")
sentences

Unnamed: 0,Quotes
0,Graphics designers are most creative people
1,Artificial Intelligence or AI is the last inve...
2,Snooker is a billiards sport for normally two ...
3,Snooker is played on a large (12 feet by 6 fee...
4,FOREX is the stock market for trading currencies
5,Software Engineering is hotter and hotter topi...
6,Love is blind
7,Snooker is popular in the United Kingdom and m...
8,The flying or operating of aircraft is known a...
9,AI is likely to be either the best or worst th...


#### **Converting our dataframe into List**

In [48]:
sentences_list = sentences["Quotes"].tolist()

In [49]:
sentences_list

['Graphics designers are most creative people',
 'Artificial Intelligence or AI is the last invention - humans could ever make',
 'Snooker is a billiards sport for normally two players.',
 'Snooker is played on a large (12 feet by 6 feet) table that is covered with a smooth green material.',
 'FOREX is the stock market for trading currencies',
 'Software Engineering is hotter and hotter topic in Silicon Valley',
 'Love is blind',
 'Snooker is popular in the United Kingdom and many other countries',
 'The flying or operating of aircraft is known as aviation.',
 'AI is likely to be either the best or worst thing happen to humanity',
 'Design is Intelligence made visible.',
 'Falling in love is like being on drugs.',
 'There is only one happiness in Life to Love and to be loved.',
 "Boeing 777 is considered world's largest economical plane in the world of Aviation.",
 'Warren Buffet is famous for making good investments.He knows stock markets',
 'The biggest of the many uses of aviation a

## **Defining a function tokenizer(text)**

In [50]:
def tokenizer(text):
  tokens = word_tokenize(text) #Word tokenization is the process of splitting a large sample of text into words.
  stemmer = PorterStemmer()

  #Removing Morphhological axes
  tokens = [stemmer.stem(t) for t in tokens if t not in stopwords.words('english')]
  return tokens


## **Defining a function cluster_sentences(sentences,k=(int))**

#### **1. Training our K - Means Model**
#### **2. Creating tfidf Vectorizer Matrix**

In [51]:
def cluster_sentences(sentences_list, k):

  #Create tf ifd again: stopwords--> we filter out common words (I,my, the,and...)
  tfidf_vectorizer = TfidfVectorizer(tokenizer=tokenizer, stop_words=stopwords.words('english'),lowercase=True)

  #builds a tf-idf matrix for the sentences
  #   Transforms text to feature vectors that can be used as input to estimator. 
  tfidf_matrix = tfidf_vectorizer.fit_transform(sentences_list)

  kmeans = KMeans(n_clusters=k)
  kmeans.fit(tfidf_matrix)

  clusters = collections.defaultdict(list)

  for i, label in enumerate(kmeans.labels_):
    clusters[label].append(i)

  return dict(clusters)

## **Testing our Model**

In [54]:

k = 7
clusters = cluster_sentences(sentences_list,k)
for cluster in range (k):
  print("\nCLUSTER ",cluster,":\n")
  for i, sentence in enumerate(clusters[cluster]):
    print("\t",(i+1),": ",sentences_list[sentence])


CLUSTER  0 :

	 1 :  Software Engineering is hotter and hotter topic in Silicon Valley
	 2 :  All giant majors in Silicon Valley is focusing AI for their business productivity
	 3 :  Software Engineer has average salary of $170K at Google

CLUSTER  1 :

	 1 :  The flying or operating of aircraft is known as aviation.
	 2 :  AI is likely to be either the best or worst thing happen to humanity
	 3 :  Boeing 777 is considered world's largest economical plane in the world of Aviation.
	 4 :  The biggest of the many uses of aviation are in air travel and military aircraft.
	 5 :  Aviation refers to flying using an aircraft like an aeroplane.

CLUSTER  2 :

	 1 :  Love is blind
	 2 :  Falling in love is like being on drugs.
	 3 :  There is only one happiness in Life to Love and to be loved.
	 4 :  Being in love is the number one reason why people wed.
	 5 :  Loving from a long distance actually strengthens a relationship
	 6 :  Real love is able to awaken your soul.

CLUSTER  3 :

	 1 :  Gr

  'stop_words.' % sorted(inconsistent))
