**LSE Data Science Institute | DS105M 2022 Week10**

**Topic:** Unstructured Data

**Author:** [@jonjoncardoso](github.com/jonjoncardoso)

**Date:** 29 November 2022

---

Obs: If you did not attend the lecture, you might notice a few gaps in your understanding when following this notebook. Watch the lecture recording.

# Why care about unstructured data?

Most datasets do not come in a tidy format that can fit perfectly well in a data frame (a structured data format). That is the case, for example, of **text data**.

# Working with Twitter Data

We will use [tweepy](https://docs.tweepy.org/en/stable/getting_started.html) library to access Twiter API.

The first thing we have to do is authenticate:

## üö®üö®üö®üö®KEEP SECRETS OUT! üö®üö®üö®üö®

EXTREMELY EXTREMELY IMPORTANT ADVICE:

- Don't use your SSH keys ANYWHERE in this notebook. Also, don't put them on Github either!!!
- Instead, create a `config.py` file somewhere outside this project (or .gitignore this file). See for example [this Stackoverflow link](https://stackoverflow.com/a/25501861/843365)

In [1]:
import config # This loads the content of the config.py file. If this throws an error, it is because you haven't created a config.py!

## Establish a connection

In [3]:
import tweepy

client = tweepy.Client(bearer_token=config.bearer_token, 
                       consumer_key=config.api_key, 
                       consumer_secret=config.api_key_secret, 
                       access_token=config.access_token, 
                       access_token_secret=config.access_token_secret)

## Obtain a few tweets just to test this works:

In [3]:
public_tweets = client.search_recent_tweets(query="Qatar")

# What is the format of the data returned?
public_tweets

Response(data=[<Tweet id=1599775387872919552 text='RT @tendenciasposta: "#Mbapp√©üè≥Ô∏è\u200düåà":\nPorque fue deportado de Qatar junto a Giroud. https://t.co/BxfeqKmfJI'>, <Tweet id=1599775387797463041 text='RT @sk_bongomin93: Qatar put their whole in crafting these world cup stadia, it could turn into a night club anytime, the world was jamming‚Ä¶'>, <Tweet id=1599775386975424512 text='RT @solopasaenperu: "No estoy en Qatar, estoy en el mundial" https://t.co/0MIcm7GJXm'>, <Tweet id=1599775386933813249 text='The @NYCMayor is in #Qatar for the #WorldcupQatar2022 while the city continues to go to shit. #priorities'>, <Tweet id=1599775385490972673 text='üá≤üáΩ El ‚ÄòTata‚Äô Martino regresa a #M√©xico entre abucheos e insultos\n\n‚ùå Los hinchas del #Tri y reporteros no perdonaron la eliminaci√≥n del seleccionado mexicano en el #Mundial de #Qatar \n\nhttps://t.co/u3asQbRxEz'>, <Tweet id=1599775383712591873 text='RT @FreeLiveLink: üî¥ùêÜùê® ùêéùêß ùêãùê¢ùêØùêûüì∫@worldcuplive

**What kind of data is returned?**

üí° From the [tweepy's Response documentation](https://docs.tweepy.org/en/stable/response.html#tweepy.Response), we read that this object is of a particular data type called [named tuple](https://realpython.com/python-namedtuple/#using-namedtuple-to-write-pythonic-code).  It is kind of a dictionary, there are fields with names that contain data inside them:

In [4]:
type(public_tweets)

tweepy.client.Response

In [15]:
public_tweets._fields

('data', 'includes', 'errors', 'meta')

In [16]:
public_tweets.data

[<Tweet id=1597576494720888837 text='‚ÄúN√£o pode isso e aquilo no Qatar‚Ä¶‚Äù\nMas eles n√£o contavam q o brasileiro tem um coisa mto apaixonante: a alegria! https://t.co/J1T1b5l2Xg'>,
 <Tweet id=1597576494628208640 text='Netherlands wins \nNetherlands 2 - 1 Qatar https://t.co/DUtWhDfIBa'>,
 <Tweet id=1597576494553133056 text='RT @PMU_Sport: üîÆ Les pr√©dictions de @ReveilAudrey &amp; de @NeauMali pour Pays-Bas // Qatar !\n\nü§î Vous √™tes #TeamPaysBas ou #TeamQatar ?\nüîÉ RT +‚Ä¶'>,
 <Tweet id=1597576494511173632 text='RT @VTVcanal8: #Qatar2022‚öΩ| Disfruta con nosotros todos los encuentros, an√°lisis, estad√≠sticas, mejores jugadas y mucho m√°s con los mejores‚Ä¶'>,
 <Tweet id=1597576494112727040 text="Vraiment a mes yeux les matchs comme √ßa y a pas + inutile üò≠ On sait tous tr√®s bien que le Qatar va ce faire ouvrir en 2 mais on s'emmerde a faire un match üò≠ https://t.co/QV4xN7cYab">,
 <Tweet id=1597576493810475009 text='RT @cfootcameroun: ¬´\xa0Je n‚Äôai pas de probl√®me a

In [21]:
public_tweets.meta

{'newest_id': '1597576494720888837',
 'oldest_id': '1597576492514410497',
 'result_count': 10,
 'next_token': 'b26v89c19zqg8o3fpzhm60iol0r4ixtrpxfityowxs1h9'}

**Interesting...**

ü§î Hmmm so we learn that our query has returned only 10 results and from the documentation, we read that `next_token` is used to paginate. This applies to all APIS: when in doubt, check the documentation!

Let's look at the next page then:

In [25]:
client.search_recent_tweets(query="Qatar", next_token='b26v89c19zqg8o3fpzhm60iol0r4ixtrpxfityowxs1h9')

Response(data=[<Tweet id=1597576491247624192 text="Migrant workers were deceived and died for Qatar's World Cup. Thousands want compensation https://t.co/8hwISaR45a">, <Tweet id=1597576490723704832 text='RT @binnahar85: Criticising Qatar this way or another is not an issue, it is actually a basic right.\nThe issue here is hypocrisy. Did u do‚Ä¶'>, <Tweet id=1597576490065219585 text='RT @PMU_Sport: üîÆ Les pr√©dictions de @ReveilAudrey &amp; de @NeauMali pour Pays-Bas // Qatar !\n\nü§î Vous √™tes #TeamPaysBas ou #TeamQatar ?\nüîÉ RT +‚Ä¶'>, <Tweet id=1597576490061033472 text='RT @_mydeszn: I‚Äôm live in Qatar. üá∂üá¶‚ù§Ô∏è‚ú®, and yes I‚Äôm actually a white Nigerian https://t.co/pH5WPgtqqs'>, <Tweet id=1597576488987291648 text='RT @HananyaNaftali: The best joke of the World Cup:\n\nQatar wants to lecture Israel on human rights. üòÇ #WorldCup2022 #FIFA https://t.co/3pb4‚Ä¶'>, <Tweet id=1597576487745781760 text='RT @RusEmbIran: Moscow has supported the national team of üáÆüá∑Iran 

## What are people talking about LSE on Twitter?

In [4]:
tweet_fields=["id", "text", "attachments", "author_id", "context_annotations", "conversation_id", 
              "created_at", "entities", "in_reply_to_user_id", "lang", "public_metrics", "geo"]

In [23]:
lse_tweets = client.search_recent_tweets(query="Qatar", 
                                         tweet_fields=tweet_fields,
                                         user_fields="location",
                                         max_results=100)

In [24]:
tweet = lse_tweets.data[0]
tweet.geo

In [31]:
tweet.entities

{'mentions': [{'start': 3,
   'end': 16,
   'username': 'HoyPalestina',
   'id': '1258174836175704064'}],
 'urls': [{'start': 96,
   'end': 119,
   'url': 'https://t.co/UyULSbvoFs',
   'expanded_url': 'https://twitter.com/HoyPalestina/status/1599761913465880577/video/1',
   'display_url': 'pic.twitter.com/UyULSbvoFs',
   'media_key': '7_1599761850454900739'}]}

In [30]:
dir(tweet)

['__abstractmethods__',
 '__class__',
 '__class_getitem__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '_abc_impl',
 'attachments',
 'author_id',
 'context_annotations',
 'conversation_id',
 'created_at',
 'data',
 'edit_controls',
 'edit_history_tweet_ids',
 'entities',
 'geo',
 'get',
 'id',
 'in_reply_to_user_id',
 'items',
 'keys',
 'lang',
 'non_public_metrics',
 'organic_metrics',
 'possibly_sensitive',
 'promoted_metrics',
 'public_metrics',
 'referenced_tweets',
 'reply_settings',
 'source',
 'text',
 'values',
 'withheld']

In [28]:
dir(lse_tweets.data[0])

['__abstractmethods__',
 '__class__',
 '__class_getitem__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '_abc_impl',
 'attachments',
 'author_id',
 'context_annotations',
 'conversation_id',
 'created_at',
 'data',
 'edit_controls',
 'edit_history_tweet_ids',
 'entities',
 'geo',
 'get',
 'id',
 'in_reply_to_user_id',
 'items',
 'keys',
 'lang',
 'non_public_metrics',
 'organic_metrics',
 'possibly_sensitive',
 'promoted_metrics',
 'public_metrics',
 'referenced_tweets',
 'reply_settings',
 'source',
 'text',
 'values',
 'withheld']

**Who tweeted that?**

In [25]:
[tweet.geo for tweet in lse_tweets.data]

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

In [18]:
tweet.author_id

311244158

In [19]:
client.get_user(id=lse_tweets.data[0].author_id)

Response(data=<User id=311244158 name=Eliana Jung Duran username=e_jungd>, includes={}, errors=[], meta={})

# Best Practices 

In [None]:
#### DATA COLLECTION ###

# 1. Create a CSV file 

# 2. Read it
df_politicians = pd.read_csv("politicians.csv")

# 3 Create a function
def get_all_tweets(username):
    ....
    <Suggestion: collect everything you can from the tweets, not just the text>
    
    <Things you can consider collecting ["id", "text", "attachments", "author_id", "context_annotations", "conversation_id", 
              "created_at", "entities", "in_reply_to_user_id", "lang", "public_metrics", "geo"]>
    ....
    return df_all_tweets # for a particular user

# 4 Collect all information
df_all_tweets = pd.concat([get_all_tweets(username) for username in df_politicians["username"]])

# 5 Save it as a CSV file


##### PRE-PROCESSING OR ANALYSIS PART

...in a separate file...


# Let's create a `tidy` dataframe

In [20]:
import pandas as pd

In [25]:
{field: tweet[field] for field in ["author_id", "text", "created_at", "lang"]}

{'author_id': 1541466817633419264,
 'text': "Gy√∂rgy left Hungary in 1947 and studied economics at the London School of Economics, where he received a bachelor's degree in 1951 and a PhD in 1954.",
 'created_at': datetime.datetime(2022, 11, 28, 21, 40, 31, tzinfo=datetime.timezone.utc),
 'lang': 'en'}

In [27]:
df = pd.concat([pd.DataFrame({field: tweet[field] 
                              for field in ["author_id", "text", "created_at", "lang"]}, index=[tweet["id"]]) 
                for tweet in lse_tweets.data])

In [74]:
df

Unnamed: 0,author_id,text,created_at,lang
1597574950415638528,63684604,Director of International Programmes and Impac...,2022-11-29 12:55:24+00:00,en
1597574643489312769,236346971,RT @fascinatorfun: ‚ÄúThat amounts to a loss of ...,2022-11-29 12:54:11+00:00,en
1597573860890931200,442976787,The London School of Economics has estimated t...,2022-11-29 12:51:04+00:00,en
1597572812923088897,276442577,RT @fascinatorfun: ‚ÄúThat amounts to a loss of ...,2022-11-29 12:46:54+00:00,en
1597571481667768320,188690239,RT @lag_uk: üì¢We are pleased to award the 2022 ...,2022-11-29 12:41:37+00:00,en
...,...,...,...,...
1597321824215650304,1159888310891945985,@philerator @phlannelphysics @bleasdale_r @The...,2022-11-28 20:09:34+00:00,en
1597318449336037376,4328654057,[ŒöŒªŒπŒ∫ Œ∫Œ±Œπ Œ¥ŒπŒ±Œ≤Œ¨œÉœÑŒµ œÉœÑŒø Œ†Œ°Œ©Œ§Œü ŒòŒïŒúŒë]ŒîŒµŒØœÑŒµ live: ...,2022-11-28 19:56:09+00:00,el
1597317637117464577,1442770309694754821,Œ£œÑŒ∑ŒΩ ŒµŒ∫Œ¥ŒÆŒªœâœÉŒ∑ œÑŒøœÖ ŒïŒªŒªŒ∑ŒΩŒπŒ∫Œøœç Œ†Œ±œÅŒ±œÑŒ∑œÅŒ∑œÑŒ∑œÅŒØŒøœÖ œÑŒøœÖ...,2022-11-28 19:52:56+00:00,el
1597317226960502784,4328654057,[ŒöŒªŒπŒ∫ Œ∫Œ±Œπ Œ¥ŒπŒ±Œ≤Œ¨œÉœÑŒµ œÉœÑŒø Œ†Œ°Œ©Œ§Œü ŒòŒïŒúŒë]ŒîŒµŒØœÑŒµ live: ...,2022-11-28 19:51:18+00:00,el


**Let's add the author_id username!**

There is a simpler way to do this, but I want to use this to demonstrate the concept of `merge`

In [28]:
df["author_id"].nunique()

98

In [29]:
all_authors = df["author_id"].unique()
all_authors

array([          311244158,            74883793,  930217579800604672,
       1497870550332579845,           940224440,            77174319,
                  48548972,  843245260088229890,           211221275,
        852575123999780864,          2425846856,  739730879770198017,
                 577431005, 1087467579751452672, 1554576805373296640,
       1061233680570556418,          1489899626,          1911206232,
       1290481714393780228,            39450170, 1007404152278929410,
       1293301271890460672, 1062397828650164227,          3030504268,
                 717983004,           322661351,          3131500904,
        874613730985861121, 1496217735948345345, 1382409342998229004,
       1029382765664501762, 1261552344724049920,          2753943800,
                  63684604,           236346971,           442976787,
                 276442577,           188690239, 1496957719982579712,
                 283549521,           578444253,           283604227,
                 105

**Use `list comprehension` to obtain author_usernames**

In [37]:
client.get_user(id=38631911).data["username"]

'B3CPres'

In [161]:
[client.get_user(id=author_id) for author_id in all_authors]

[Response(data=<User id=63684604 name=Guardian Exec Jobs username=GJ_Exec>, includes={}, errors=[], meta={}),
 Response(data=<User id=236346971 name=Bob Massam #FBPEüî∂üíô username=bmassam>, includes={}, errors=[], meta={}),
 Response(data=<User id=442976787 name=Ros Chappell üá∫üá¶StandWithUkraine ‚≠êÔ∏èUK Rejoin üá™üá∫ username=RosChappell>, includes={}, errors=[], meta={}),
 Response(data=<User id=276442577 name=12Pat username=StarterPat>, includes={}, errors=[], meta={}),
 Response(data=<User id=188690239 name=Sam Halvorsen username=samhalvorsen>, includes={}, errors=[], meta={}),
 Response(data=<User id=1496957719982579712 name=M P username=mexxez16>, includes={}, errors=[], meta={}),
 Response(data=<User id=283549521 name=Matthew Aaron Richmond username=mattyrichy>, includes={}, errors=[], meta={}),
 Response(data=<User id=578444253 name=Dominique username=dominiquelevack>, includes={}, errors=[], meta={}),
 Response(data=<User id=283604227 name=Andy Vermaut username=AndyVe

**That took quite some time... How do we check if my code is stuck?**

üí°Use a library called tqdm for progress bar:

In [38]:
import tqdm

author_usernames = [client.get_user(id=author_id).data["username"] for author_id in tqdm.tqdm(all_authors)]



  0%|                                                               | 0/98 [00:00<?, ?it/s][A[A

  1%|‚ñå                                                      | 1/98 [00:00<00:18,  5.33it/s][A[A

  2%|‚ñà                                                      | 2/98 [00:00<00:17,  5.40it/s][A[A

  3%|‚ñà‚ñã                                                     | 3/98 [00:00<00:18,  5.06it/s][A[A

  4%|‚ñà‚ñà‚ñè                                                    | 4/98 [00:00<00:18,  4.98it/s][A[A

  5%|‚ñà‚ñà‚ñä                                                    | 5/98 [00:00<00:18,  4.93it/s][A[A

  6%|‚ñà‚ñà‚ñà‚ñé                                                   | 6/98 [00:01<00:18,  4.95it/s][A[A

  7%|‚ñà‚ñà‚ñà‚ñâ                                                   | 7/98 [00:01<00:18,  4.92it/s][A[A

  8%|‚ñà‚ñà‚ñà‚ñà‚ñç                                                  | 8/98 [00:01<00:18,  4.91it/s][A[A

  9%|‚ñà‚ñà‚ñà‚ñà‚ñà                                         

**Ok, but we've done the same thing but we did not save it anywhere**

In [39]:
df_authors = pd.DataFrame({"author_id": all_authors,
                           "author_username": author_usernames})
df_authors

Unnamed: 0,author_id,author_username
0,311244158,e_jungd
1,74883793,GonzaloBarria
2,930217579800604672,KdXI8GYHGr8fz1s
3,1497870550332579845,LinaMpoumpou
4,940224440,frank29273107
...,...,...
93,2579454504,xpgomes3
94,1113426385974984705,LindaMulcahy7
95,1527074713,AntennaNews
96,38631911,B3CPres


In [40]:
df = pd.merge(df, df_authors, how="left", on=["author_id"])

In [41]:
df

Unnamed: 0,author_id,text,created_at,lang,author_username
0,311244158,"RT @icp_uc: ¬°Que orgullo!üëèüí™Yancy Villarroel, e...",2022-11-29 16:28:49+00:00,es,e_jungd
1,74883793,"RT @icp_uc: ¬°Que orgullo!üëèüí™Yancy Villarroel, e...",2022-11-29 16:25:56+00:00,es,GonzaloBarria
2,930217579800604672,RT @evangelia_re: Œ†ŒÆŒ≥Œµ Œø ŒúŒ∑œÑœÉŒøœÑŒ¨Œ∫Œ∑œÇ œÉœÑŒø london...,2022-11-29 16:19:00+00:00,el,KdXI8GYHGr8fz1s
3,1497870550332579845,RT @evangelia_re: Œ†ŒÆŒ≥Œµ Œø ŒúŒ∑œÑœÉŒøœÑŒ¨Œ∫Œ∑œÇ œÉœÑŒø london...,2022-11-29 16:18:04+00:00,el,LinaMpoumpou
4,940224440,RT @evangelia_re: Œ†ŒÆŒ≥Œµ Œø ŒúŒ∑œÑœÉŒøœÑŒ¨Œ∫Œ∑œÇ œÉœÑŒø london...,2022-11-29 16:01:49+00:00,el,frank29273107
...,...,...,...,...,...
95,1527074713,ŒúŒ∑œÑœÉŒøœÑŒ¨Œ∫Œ∑œÇ œÉœÑŒø London School of Economics: ŒïœÜŒπ...,2022-11-28 22:01:55+00:00,el,AntennaNews
96,1581543998090579968,RT @NikosKoudounis1: @kostasp1488 Œ°ŒµŒªŒ¨ŒΩœÇ œÑŒøœÖ Œ§...,2022-11-28 22:00:52+00:00,el,NikosKoudounis1
97,1581543998090579968,RT @NikosKoudounis1: @chris_avramidis @theodor...,2022-11-28 22:00:41+00:00,el,NikosKoudounis1
98,38631911,RT @BrexitBin: In case you missed it ...\nHere...,2022-11-28 21:46:12+00:00,en,B3CPres


# Extract tokens

Tokenisation is the process of segmenting text into words, punctuations marks etc.

We are going to use spaCy

## Which languages are involved?

In [42]:
df["lang"].value_counts()

en    49
el    43
es     5
in     2
de     1
Name: lang, dtype: int64

Check twitter API Documentation to check the languages: https://developer.twitter.com/en/docs/twitter-for-websites/supported-languages

## Let's focus on language='en' first

In [43]:
# Pick one random sample
sample = df.query("lang == 'en'").sample(1)

sample

Unnamed: 0,author_id,text,created_at,lang,author_username
57,1586632345129742336,"he waged\na relentless, lifelong struggle for ...",2022-11-29 10:05:46+00:00,en,MadrasTribes


In [45]:
just_the_text = sample["text"].values[0]
print(just_the_text)

he waged
a relentless, lifelong struggle for the rights of Dalits. Having
earned doctorates from both Columbia University and the
London School of Economics, went on to serve as Chairman
of the Drafting Committee of the Indian Constitution.

(2)


üó£Ô∏è **CLASSROOM DISCUSSION:** What does the following represent?

In [46]:
len(just_the_text)

245

In [47]:
type(just_the_text)

str

### Tokenization

**We need to load the language related features from spaCy**

In [48]:
from spacy.lang.en import English
language_parser = English()

tokenized_text = language_parser(just_the_text)
type(tokenized_text)

  0%|                                                               | 0/98 [12:26<?, ?it/s]


spacy.tokens.doc.Doc

In [49]:
tokenized_text

he waged
a relentless, lifelong struggle for the rights of Dalits. Having
earned doctorates from both Columbia University and the
London School of Economics, went on to serve as Chairman
of the Drafting Committee of the Indian Constitution.

(2)

üó£Ô∏è **CLASSROOM DISCUSSION:** What do you think the following represents?

In [50]:
len(tokenized_text)

50

In [222]:
for token in tokenized_text:
    print(token)

RT
@BrexitBin
:
In
case
you
missed
it
...


Here
's
an
excerpt
from
a
report
into
immigration
and
wages
by
the
London
School
of
Economics
.
It
s
‚Ä¶


**Silly way to count repeated tokens using `list comprehension`**

In [51]:
pd.Series([token for token in tokenized_text]).value_counts()

he              1
of              1
of              1
Economics       1
,               1
went            1
on              1
to              1
serve           1
as              1
Chairman        1
\n              1
the             1
waged           1
Drafting        1
Committee       1
of              1
the             1
Indian          1
Constitution    1
.               1
\n\n            1
(               1
2               1
School          1
London          1
\n              1
the             1
\n              1
a               1
relentless      1
,               1
lifelong        1
struggle        1
for             1
the             1
rights          1
of              1
Dalits          1
.               1
Having          1
\n              1
earned          1
doctorates      1
from            1
both            1
Columbia        1
University      1
and             1
)               1
dtype: int64

**Let's change everything to lowercase:**

In [224]:
new_tokenized_test = language_parser(just_the_text.strip().lower())

pd.Series([token for token in new_tokenized_test]).value_counts()

rt             1
report         1
s              1
it             1
.              1
economics      1
of             1
school         1
london         1
the            1
by             1
wages          1
and            1
immigration    1
into           1
a              1
@brexitbin     1
from           1
excerpt        1
an             1
's             1
here           1
\n             1
...            1
it             1
missed         1
you            1
case           1
in             1
:              1
‚Ä¶              1
dtype: int64

### Lemmatization

https://spacy.io/usage/linguistic-features#lemmatization

There are **pre-trained** fancy NLP models that can detect more interesting things about our text.

In [235]:
# You need to download the suitable NLP models https://spacy.io/models/en
#!python -m spacy download en_core_web_sm

In [53]:
import spacy

nlp = spacy.load("en_core_web_sm")

In [57]:
fancier_tokenized_text = nlp(just_the_text.strip().lower())

fancier_tokenized_text[1].lemma_

'wage'

In [63]:
my_tokens = [word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in fancier_tokenized_text ]

### Remove punctuation & stop words

In [64]:
import string

from spacy.lang.en.stop_words import STOP_WORDS

# Create our list of punctuation marks
punctuations = string.punctuation

# Stop words for the English language
stop_words = spacy.lang.en.stop_words.STOP_WORDS

# Removing stop words
my_tokens = [ word for word in my_tokens if word not in stop_words and word not in punctuations ]
my_tokens

['wage',
 'relentless',
 'lifelong',
 'struggle',
 'right',
 'dalit',
 'having',
 'earn',
 'doctorate',
 'columbia',
 'university',
 'london',
 'school',
 'economic',
 'serve',
 'chairman',
 'drafting',
 'committee',
 'indian',
 'constitution',
 '2']

In [243]:
len(my_tokens)

17

## Putting everything together: automating this process

Also, check [this tutorial](https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/)

In [66]:
import string

from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

# Create our list of punctuation marks
punctuations = string.punctuation

# Create our list of stopwords
nlp = spacy.load("en_core_web_sm")
stop_words = spacy.lang.en.stop_words.STOP_WORDS

def clean_text(tweet_text):
    simpler_text = tweet_text.strip().lower()
    mytokens = nlp(simpler_text)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]
    return mytokens

Use pandas `apply` to make your code look cleaner!

In [70]:
df_en = df.query("lang == 'en'")

In [71]:
df_en["text"].apply(clean_text)

7     [rt, @pleaseletmevote, london, school, economi...
17    [director, international, programme, impact, l...
18    [tax, cut, wealthy, long, draw, support, conse...
23    [rt, @lag_uk, üì¢, pleased, award, 2022, latin, ...
25    [rt, @fascinatorfun, loss, 20, pound, value, d...
32    [rt, @tanzania_epza, y'day, courtesy, visit, c...
33    [director, international, programme, impact, l...
34    [rt, @fascinatorfun, loss, 20, pound, value, d...
35    [london, school, economic, estimate, brexit, ‚Äì...
36    [rt, @fascinatorfun, loss, 20, pound, value, d...
37    [rt, @lag_uk, üì¢, pleased, award, 2022, latin, ...
39    [rt, @lag_uk, üì¢, pleased, award, 2022, latin, ...
40    [london, school, economic, estimate, brexit, ....
41    [andy, vermaut, share, finance, employee, futu...
42    [rt, @lag_uk, üì¢, pleased, award, 2022, latin, ...
43    [rt, @fascinatorfun, loss, 20, pound, value, d...
44    [rt, @fascinatorfun, loss, 20, pound, value, d...
45    [üì¢, pleased, award, 2022,

**There is a fancy tqdm for pandas**

In [72]:
from tqdm import tqdm
# from tqdm.auto import tqdm  # for notebooks

# Create new `pandas` methods which use `tqdm` progress
# (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas()

In [73]:
df_en["text"].progress_apply(clean_text)

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 49/49 [00:01<00:00, 39.23it/s]


7     [rt, @pleaseletmevote, london, school, economi...
17    [director, international, programme, impact, l...
18    [tax, cut, wealthy, long, draw, support, conse...
23    [rt, @lag_uk, üì¢, pleased, award, 2022, latin, ...
25    [rt, @fascinatorfun, loss, 20, pound, value, d...
32    [rt, @tanzania_epza, y'day, courtesy, visit, c...
33    [director, international, programme, impact, l...
34    [rt, @fascinatorfun, loss, 20, pound, value, d...
35    [london, school, economic, estimate, brexit, ‚Äì...
36    [rt, @fascinatorfun, loss, 20, pound, value, d...
37    [rt, @lag_uk, üì¢, pleased, award, 2022, latin, ...
39    [rt, @lag_uk, üì¢, pleased, award, 2022, latin, ...
40    [london, school, economic, estimate, brexit, ....
41    [andy, vermaut, share, finance, employee, futu...
42    [rt, @lag_uk, üì¢, pleased, award, 2022, latin, ...
43    [rt, @fascinatorfun, loss, 20, pound, value, d...
44    [rt, @fascinatorfun, loss, 20, pound, value, d...
45    [üì¢, pleased, award, 2022,

# Now what

Now that you have cleaned and tokenized everything, you can do the fun stuff:

## Bag of Words 

Use `CountVectorizer` from [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) package to create a bag of words

In [74]:
from sklearn.feature_extraction.text import CountVectorizer

bow_vector = CountVectorizer(tokenizer = clean_text, ngram_range=(1,1))

In [75]:
bow_vector.fit_transform(df.query("lang == 'en'")["text"])

<49x321 sparse matrix of type '<class 'numpy.int64'>'
	with 806 stored elements in Compressed Sparse Row format>

In [76]:
bow_vector.fit_transform(df.query("lang == 'en'")["text"]).todense().shape

(49, 321)

In [259]:
#bow_vector.get_feature_names_out()

In [77]:
df_bag_words = pd.DataFrame(bow_vector.fit_transform(df.query("lang == 'en'")["text"]).todense(),
                            columns=bow_vector.get_feature_names_out())

In [78]:
df_bag_words

Unnamed: 0,..,...,.....,01,08:46,1/3,16,1947,1951,1954,....1,workin,world,wraptite,y'day,year,‚Äì,‚Äî,‚Ä¶,üåç,üì¢
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,2,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
8,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [None]:
l

**What are the most frequent tokens?**

In [79]:
df_bag_words.sum().sort_values(ascending=False).head(10)

london      45
school      44
economic    39
‚Ä¶           28
rt          28
20          12
brexit      12
loss        12
value       12
pound       12
dtype: int64

üó£Ô∏è **CLASSROOM DISCUSSION** What would you do with this data now?

### PCA + plotly

In [268]:
!pip install plotly==5.11.0

^C


In [271]:
## Train an algorithm called PCA

## Read more about it here https://scikit-learn.org/stable/modules/decomposition.html#pca

from sklearn.decomposition import PCA

pca = PCA()
components = pca.fit_transform(df_bag_words)

In [304]:
df_pca = pd.DataFrame(components, columns=[f"PC{i+1}" for i in range(components.shape[1])])