<a href="https://colab.research.google.com/github/StephJones87/Natural_Language_Processing_Tasks/blob/master/001_conversation_clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step by step conversation clustering
# Step 1 - using the twitter API to get tweets bold text

In [0]:
# import relevant modules


import tweepy
from tweepy import OAuthHandler
from tweepy import API

In [0]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

In [0]:
# scraping tweets

api = tweepy.API(auth, timeout=15)

all_tweets = []

search_query = 'asthma'

# execute a loop that uses tweepy’s  Cursor  object to fetch tweets. The Cursor  object takes several parameters which are as follows:
#The first parameter is the type of operation you want to perform. We want to search tweets; therefore, specify api.search  as the first parameter.
#The second parameter is the search query. In addition to the search query, specify that  -filter:retweets  which means: do not fetch retweets.
#The third parameter is the language where we specify “en” since we only want English tweets.
#Finally, the  result_type  parameter is set to “recent” since we only need recent tweets.
#The item attribute sets the number of tweets to return. Here we return only the 2000 recent most tweets.
#Once you execute the script above, you will see 2000 most recent tweets containing the string “asthma” will be stored in the all_tweets  list.

for tweet_object in tweepy.Cursor(api.search,q=search_query+" -filter:retweets",lang='en',result_type='recent').items(2000):
    all_tweets.append(tweet_object.text)

Step 2 - turning the list of tweets into a pandas dataframe, label the columns and assign each tweet a unique ID

In [0]:
import pandas as pd

df = pd.DataFrame(all_tweets)

In [6]:
df.head()

Unnamed: 0,0
0,Proper cba with my asthma kicking off during a...
1,"In stigmatized city, calls grow to abolish ter..."
2,@Odhiambo_Issa @OmondiBetty @marcusolang It's ...
3,@lensawag1 @ambitioust2428 I heard that one ca...
4,10 Million Steps for Asthma https://t.co/Z2wET...


In [7]:
# renaming columns on dataframe 

print(df.columns)

RangeIndex(start=0, stop=1, step=1)


In [0]:
# renaming columns on dataframe, adding unique ID

df.columns = ['raw_tweet']



In [9]:
df.head()

Unnamed: 0,raw_tweet
0,Proper cba with my asthma kicking off during a...
1,"In stigmatized city, calls grow to abolish ter..."
2,@Odhiambo_Issa @OmondiBetty @marcusolang It's ...
3,@lensawag1 @ambitioust2428 I heard that one ca...
4,10 Million Steps for Asthma https://t.co/Z2wET...


In [0]:
df['unique_ID'] = range(1, 1+len(df))

In [11]:
df.head()

Unnamed: 0,raw_tweet,unique_ID
0,Proper cba with my asthma kicking off during a...,1
1,"In stigmatized city, calls grow to abolish ter...",2
2,@Odhiambo_Issa @OmondiBetty @marcusolang It's ...,3
3,@lensawag1 @ambitioust2428 I heard that one ca...,4
4,10 Million Steps for Asthma https://t.co/Z2wET...,5


In [0]:
df = df[['unique_ID', 'raw_tweet']]

In [13]:
df.head()

Unnamed: 0,unique_ID,raw_tweet
0,1,Proper cba with my asthma kicking off during a...
1,2,"In stigmatized city, calls grow to abolish ter..."
2,3,@Odhiambo_Issa @OmondiBetty @marcusolang It's ...
3,4,@lensawag1 @ambitioust2428 I heard that one ca...
4,5,10 Million Steps for Asthma https://t.co/Z2wET...


In [14]:
# confirming the number of tweets 
df.shape

(2000, 2)

Cleaning the tweets - for this project I have chosen to remove - duplicate tweets and @mentions 

In [0]:
df.drop_duplicates(subset=None, keep=False, inplace=True)

In [16]:
df.shape

(2000, 2)

In [0]:
# it appears there are no duplicates - I find this strange - anyway for now I will continue and go back later

In [0]:
import spacy
nlp = spacy.load('en')

df['tokenized'] = df['raw_tweet'].apply(lambda x: nlp.tokenizer(x))

In [19]:
df.head()

Unnamed: 0,unique_ID,raw_tweet,tokenized
0,1,Proper cba with my asthma kicking off during a...,"(Proper, cba, with, my, asthma, kicking, off, ..."
1,2,"In stigmatized city, calls grow to abolish ter...","(In, stigmatized, city, ,, calls, grow, to, ab..."
2,3,@Odhiambo_Issa @OmondiBetty @marcusolang It's ...,"(@Odhiambo_Issa, @OmondiBetty, @marcusolang, I..."
3,4,@lensawag1 @ambitioust2428 I heard that one ca...,"(@lensawag1, @ambitioust2428, I, heard, that, ..."
4,5,10 Million Steps for Asthma https://t.co/Z2wET...,"(10, Million, Steps, for, Asthma, https://t.co..."


Once I have the tweets tokenized - I now need to turn the tweets into a language the computer can understand - which in this case is a matrix of numbers. Each tweet will now be described as a 'document' - a document can be any length e.g. a sentence, a tweet, a paragraph or a long piece of text. 

I will use sklearn TF-IDF. Term frequency vs Inverse Document Frequency. This takes the number of times a word appears in a document (in my case tweet) divided by the number of times that word is used in the corpus as a whole - and in my example a corpus is the entirety of the 2,000 tweets. It aims to reflect how important a word is to a document. I hope words like 'the' aren't important and maybe the drugs used gets a higher number, therefore bringing to the forefront the more important / useful words in the tweet. 


To do this I will use sklearn tfidf which converts a collection of raw documents to a matrix of TF-IDF features.

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier

import numpy as np


In [0]:

vectoriser = TfidfVectorizer()
df['tweetsVect'] = list(vectoriser.fit_transform(df['raw_tweet']).toarray())


In [23]:
df.head()

Unnamed: 0,unique_ID,raw_tweet,tokenized,tweetsVect
0,1,Proper cba with my asthma kicking off during a...,"(Proper, cba, with, my, asthma, kicking, off, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,2,"In stigmatized city, calls grow to abolish ter...","(In, stigmatized, city, ,, calls, grow, to, ab...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,3,@Odhiambo_Issa @OmondiBetty @marcusolang It's ...,"(@Odhiambo_Issa, @OmondiBetty, @marcusolang, I...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,4,@lensawag1 @ambitioust2428 I heard that one ca...,"(@lensawag1, @ambitioust2428, I, heard, that, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,5,10 Million Steps for Asthma https://t.co/Z2wET...,"(10, Million, Steps, for, Asthma, https://t.co...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [26]:

# select a column as series and then convert it into a column
list_of_vectors = df['tweetsVect'].to_list()
print('List of Vectors: ', list_of_vectors)
print('Type of ListofVectors: ', type(list_of_vectors))


List of Vectors:  [array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0., 0.]), array([0., 0., 0., ..., 0., 0.,

In [27]:
len(list_of_vectors)

2000

In [29]:
# I want to have some idea of what these vectors look like - how long are they? 

len(list_of_vectors[0])

7877

In [0]:
# From this I think this means that my dictionary/vocabulary is 7,877 words - meaning 7,877 words in different orders make up my 2,000 tweets. 

In [0]:
# So if you imagine a numerical matrix to represent this data - the matrix is 7,877 columns wide and 2,000 rows. 
# Ideally I want to compare the similarity of these vectors. How do I do this?

In [0]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(list_of_vectors, list_of_vectors)

In [35]:
cosine_sim

array([[1.        , 0.00780497, 0.00378426, ..., 0.01768924, 0.00539808,
        0.03553623],
       [0.00780497, 1.        , 0.0034042 , ..., 0.004977  , 0.00485594,
        0.0088391 ],
       [0.00378426, 0.0034042 , 1.        , ..., 0.00821289, 0.        ,
        0.01401312],
       ...,
       [0.01768924, 0.004977  , 0.00821289, ..., 1.        , 0.0460871 ,
        0.01336898],
       [0.00539808, 0.00485594, 0.        , ..., 0.0460871 , 1.        ,
        0.04618154],
       [0.03553623, 0.0088391 , 0.01401312, ..., 0.01336898, 0.04618154,
        1.        ]])

In [36]:
len(cosine_sim)

2000

In [0]:
# im left with a list inside a list of cosine vectors and I need to find out how they compare to each other? for similarity?

