# Data Cleaning on tweets about Messi and Ronaldo, by Ibrahim SEROUIS 💻

## What is data cleaning ? 🧼

"Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset." [read more](https://www.tableau.com/learn/articles/what-is-data-cleaning)

## What to expect 🤔

In this Notebook, we're going to perform some data cleaning operations on [these tweets](https://www.kaggle.com/datasets/ibrahimserouis99/twitter-sentiment-analysis-and-word-embeddings).

# Libraries

In [1]:
!pip install --user clean-text

Collecting clean-text
  Downloading clean_text-0.6.0-py3-none-any.whl (11 kB)
Collecting ftfy<7.0,>=6.0
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 473 kB/s 
Installing collected packages: ftfy, clean-text
Successfully installed clean-text-0.6.0 ftfy-6.1.1


In [2]:
import re
import pandas as pd
from cleantext import clean

# Files

In [3]:
dataset_messi = pd.read_csv("/kaggle/input/twitter-sentiment-analysis-and-word-embeddings/messi_tweets.csv", encoding="utf-8")
dataset_ronaldo = pd.read_csv("/kaggle/input/twitter-sentiment-analysis-and-word-embeddings/ronaldo_tweets.csv", encoding="utf-8")

  exec(code_obj, self.user_global_ns, self.user_ns)


# Display some samples

## Messi

In [4]:
dataset_messi.head(5)

Unnamed: 0,tweet_id,author_id,content,lang,date,source,geo,retweet_count,like_count,quote_count
0,1516884143518461955,1407652046677987328,this saturday lionel andrés messi will win the...,en,2022-04-20T20:58:56.000Z,Twitter for Android,,0,0,0
1,1516884135561908224,1268536714177531905,@goal @PSG_English @PSG_inside WOW great for h...,en,2022-04-20T20:58:54.000Z,Twitter for Android,,0,0,0
2,1516884129614372864,1303984698461507585,@kokipower23 @moharmcf_ @LaPulgafan10 @TheEuro...,en,2022-04-20T20:58:53.000Z,Twitter for Android,,0,0,0
3,1516884128955777024,1245698960343302146,@PSG_English Messi was holding this team back.,en,2022-04-20T20:58:53.000Z,Twitter for Android,,0,0,0
4,1516884112811962368,1432532648472219650,@Nate7z Today Benzema Is better player. In car...,en,2022-04-20T20:58:49.000Z,Twitter Web App,,0,0,0


## Ronaldo

In [5]:
dataset_ronaldo.head(5)

Unnamed: 0,tweet_id,author_id,content,lang,date,source,geo,retweet_count,like_count,quote_count
0,1516885082958663680,311410036,@talkSPORT He’s channeling his inner Ronaldo,en,2022-04-20T21:02:40.000Z,Twitter for iPhone,,0.0,0.0,0.0
1,1516885060074582018,968820590,Is Ronaldo back on Saturday?,en,2022-04-20T21:02:35.000Z,Twitter for iPhone,,0.0,0.0,0.0
2,1516885051337805824,3007018857,@JailedGed 7 goals all season. Ev fans class h...,en,2022-04-20T21:02:33.000Z,Twitter for iPhone,,0.0,1.0,0.0
3,1516885050578837506,1516158593791565825,@PSGhub @canalplus Ronaldo's baby boy 😭😭😭 http...,en,2022-04-20T21:02:33.000Z,Twitter for Android,,0.0,0.0,0.0
4,1516885041250504704,4756096577,Y’all said he will outscore Ronaldo 😭🤲 https:/...,en,2022-04-20T21:02:30.000Z,Twitter for iPhone,,0.0,0.0,0.0


# Data cleaning 🧼🧽

## Messi

### Check for null values

In [6]:
print(f"Null values per column : \n\n{dataset_messi.isna().sum()}")

Null values per column : 

tweet_id              0
author_id             0
content               0
lang                  0
date                  0
source                0
geo              158969
retweet_count         0
like_count            0
quote_count           0
dtype: int64


### Drop duplicates
Based on tweet ID

In [7]:
print(f"Row count before deletion: {len(dataset_messi.index)} ")
dataset_messi = dataset_messi.drop_duplicates(subset="tweet_id", keep="first")
print(f"Row count after deletion: {len(dataset_messi.index)}")

Row count before deletion: 160195 
Row count after deletion: 123634


#### Percentage of null geo values 

In [8]:
count = len(dataset_messi.index)
print(f"Number of values : {count}")

Number of values : 123634


In [9]:
count_null = dataset_messi.isna().sum()["geo"]
percentage = count_null*100/count
print(f"Percentage of null geo values {round(percentage,2)}%")

Percentage of null geo values 99.21%


### Assign the -1 id to null geo values

In [10]:
dataset_messi.geo = dataset_messi.geo.apply(lambda x: -1 if pd.isna(x) else x)

#### Display results

In [11]:
dataset_messi["geo"].head(5)

0    -1
1    -1
2    -1
3    -1
4    -1
Name: geo, dtype: object

In [12]:
print(f"Some locations ids: {dataset_messi['geo'].unique()[0:5]}")

Some locations ids: [-1 '01aadce76841e2c5' '6565298bcadb82a1' 'cee903c102af3a7f'
 'bb94af3e1fdbeb7f']


### Clean tweets : remove mentions, extra spaces and links

#### Utility function

In [13]:
def clean_tweet(text):
    """
    Removes punctuation, emojis, normalizes whitespaces...from a text
    """
    
    text = clean(text,
                 no_punct=True,
                 lower=True,
                 no_emoji=True,
                 normalize_whitespace=True
                )
    
    return text

#### Create the text cleaning regular expressions 

In [14]:
# Remove mentions
regex_mentions = r"@[A-Za-z0-9_]+"
# Remove links
regex_links = r"https?://[A-Za-z0-9./]+"
# Remove some special characters
regex_special = r"[^A-Za-z0-9]+"
# Remove numbers 
regex_numbers = r"[0-9]+"
# Remove ordinals 
regex_ordinals = r"[0-9]+(?:st| st|nd| nd|rd| rd|th| th)"

### Clean tweets : remove mentions, links, special characters, extra spaces...

In [15]:
# Remove mentions
dataset_messi.content = dataset_messi.content.apply(lambda x: re.sub(regex_mentions, " ", str(x).strip()))
# Remove links 
dataset_messi.content = dataset_messi.content.apply(lambda x: re.sub(regex_links, " ", str(x).strip()))
# Remove special characters
dataset_messi.content = dataset_messi.content.apply(lambda x: re.sub(regex_special, " ", str(x).strip()))
# Remove ordinals
dataset_messi.content = dataset_messi.content.apply(lambda x: re.sub(regex_ordinals, " ", str(x).strip()))
# Remove numbers 
dataset_messi.content = dataset_messi.content.apply(lambda x: re.sub(regex_numbers, " ", str(x).strip()))
# Clean tweets
dataset_messi.content = dataset_messi.content.apply(lambda x: clean_tweet(x)) 

#### Display results

In [16]:
dataset_messi.head(5)

Unnamed: 0,tweet_id,author_id,content,lang,date,source,geo,retweet_count,like_count,quote_count
0,1516884143518461955,1407652046677987328,this saturday lionel andr s messi will win the...,en,2022-04-20T20:58:56.000Z,Twitter for Android,-1,0,0,0
1,1516884135561908224,1268536714177531905,wow great for him just one more to reach messi,en,2022-04-20T20:58:54.000Z,Twitter for Android,-1,0,0,0
2,1516884129614372864,1303984698461507585,messi ain t the worst at pens stop the,en,2022-04-20T20:58:53.000Z,Twitter for Android,-1,0,0,0
3,1516884128955777024,1245698960343302146,messi was holding this team back,en,2022-04-20T20:58:53.000Z,Twitter for Android,-1,0,0,0
4,1516884112811962368,1432532648472219650,today benzema is better player in career not e...,en,2022-04-20T20:58:49.000Z,Twitter Web App,-1,0,0,0


### Check for non-english content

In [17]:
assert dataset_messi["lang"].unique()=="en", "Non-english content detected"

> Test passed

## Ronaldo

### Check for null values

In [18]:
dataset_ronaldo.isna().sum()

tweet_id              0
author_id             2
content               2
lang                  5
date                  8
source                5
geo              158252
retweet_count         5
like_count            8
quote_count           8
dtype: int64

### Drop null values
For columns relevant to our analysis

In [19]:
dataset_ronaldo.dropna(subset=["lang","date","source","retweet_count","like_count", "quote_count"], inplace=True)

In [20]:
print(f"Null values: \n{dataset_ronaldo.isna().sum()}")

Null values: 
tweet_id              0
author_id             0
content               0
lang                  0
date                  0
source                0
geo              158247
retweet_count         0
like_count            0
quote_count           0
dtype: int64


### Drop duplicates

In [21]:
print(f"Row count before deletion: {len(dataset_ronaldo.index)} ")
dataset_ronaldo = dataset_ronaldo.drop_duplicates(subset="tweet_id", keep="first")
print(f"Row count after deletion: {len(dataset_ronaldo.index)}")

Row count before deletion: 160318 
Row count after deletion: 160318


### Percentage of null geo values

In [22]:
count = len(dataset_ronaldo.index)
print(f"Number of values: {count}")

Number of values: 160318


In [23]:
count_null = dataset_ronaldo.isna().sum()["geo"]
percentage = count_null * 100/count
print(f"Percentage of null geo values: {round(percentage,2)}%")

Percentage of null geo values: 98.71%


### Assign the -1 id to null geo values

In [24]:
dataset_ronaldo.geo = dataset_ronaldo.geo.apply(lambda x: -1 if pd.isna(x) else x)

#### Display results

In [25]:
dataset_ronaldo["geo"].head(5)

0    -1
1    -1
2    -1
3    -1
4    -1
Name: geo, dtype: object

In [26]:
print(f"Some location ids: {dataset_ronaldo['geo'].unique()[0:5]}")

Some location ids: [-1 '0e587c59401d0a27' '67687709552688fe' '001907e868d06e24'
 '3dc7b71f520e2d15']


### Clean tweets : remove mentions, links, special characters, extra spaces...

In [27]:
# Remove mentions
dataset_ronaldo.content = dataset_ronaldo.content.apply(lambda x: re.sub(regex_mentions, " ", str(x).strip()))
# Remove links
dataset_ronaldo.content = dataset_ronaldo.content.apply(lambda x: re.sub(regex_links, " ", str(x).strip()))
# Remove special characters
dataset_ronaldo.content = dataset_ronaldo.content.apply(lambda x: re.sub(regex_special, " ", str(x).strip()))
# Remove ordinals
dataset_ronaldo.content = dataset_ronaldo.content.apply(lambda x: re.sub(regex_ordinals, " ", str(x).strip()))
# Remove numbers
dataset_ronaldo.content = dataset_ronaldo.content.apply(lambda x: re.sub(regex_numbers, " ", str(x).strip()))
# Clean tweets
dataset_ronaldo.content = dataset_ronaldo.content.apply(lambda x: clean_tweet(x)) 

#### Display results

In [28]:
dataset_ronaldo.head(5)

Unnamed: 0,tweet_id,author_id,content,lang,date,source,geo,retweet_count,like_count,quote_count
0,1516885082958663680,311410036,he s channeling his inner ronaldo,en,2022-04-20T21:02:40.000Z,Twitter for iPhone,-1,0.0,0.0,0.0
1,1516885060074582018,968820590,is ronaldo back on saturday,en,2022-04-20T21:02:35.000Z,Twitter for iPhone,-1,0.0,0.0,0.0
2,1516885051337805824,3007018857,goals all season ev fans class him like ronald...,en,2022-04-20T21:02:33.000Z,Twitter for iPhone,-1,0.0,1.0,0.0
3,1516885050578837506,1516158593791565825,ronaldo s baby boy,en,2022-04-20T21:02:33.000Z,Twitter for Android,-1,0.0,0.0,0.0
4,1516885041250504704,4756096577,y all said he will outscore ronaldo,en,2022-04-20T21:02:30.000Z,Twitter for iPhone,-1,0.0,0.0,0.0


### Check for non-english content

In [29]:
assert dataset_ronaldo["lang"].unique()=="en", "Non-english content detected"

```
Test passed
```

# Save the cleaned datasets 💾

> Note : Index = false tells the Pandas library not to add an index when writing the file. In our case, since there's already an index, no need to write a new one. 

## Messi 

In [30]:
dataset_messi.to_csv("Cleaned_messi_tweets.csv", index=False)

## Ronaldo

In [31]:
dataset_ronaldo.to_csv("Cleaned_ronaldo_tweets.csv", index=False)

# Thank you for your time 😄