# Data Cleaning on tweets about Messi and Ronaldo, by Ibrahim SEROUIS 💻

## What is data cleaning ? 🧼

"Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset." [read more](https://www.tableau.com/learn/articles/what-is-data-cleaning)

## What to expect 🤔

In this Notebook, we're going to perform some data cleaning operations on [these tweets](https://www.kaggle.com/datasets/ibrahimserouis99/twitter-sentiment-analysis-and-word-embeddings).

# Libraries

In [1]:
!pip install --user clean-text

Collecting clean-text
  Downloading clean_text-0.6.0-py3-none-any.whl (11 kB)
Collecting ftfy<7.0,>=6.0
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 913 kB/s 
Installing collected packages: ftfy, clean-text
Successfully installed clean-text-0.6.0 ftfy-6.1.1


In [2]:
import os
import re
import numpy as np
import pandas as pd
from cleantext import clean

# Files

## Dataset donwload

In [3]:
dataset_messi = pd.read_csv("/kaggle/input/twitter-sentiment-analysis-and-word-embeddings/messi_tweets.csv", encoding="utf-8")
dataset_ronaldo = pd.read_csv("/kaggle/input/twitter-sentiment-analysis-and-word-embeddings/ronaldo_tweets.csv", encoding="utf-8")

  exec(code_obj, self.user_global_ns, self.user_ns)


# Show some examples

## Messi

In [4]:
dataset_messi.head(5)

Unnamed: 0,tweet_id,author_id,content,lang,date,source,geo,retweet_count,like_count,quote_count
0,1515608437270454276,1464517603637014532,@TvPrometheus @ViniBest20 @goatscristiano @Tot...,en,2022-04-17T08:29:44.000Z,Twitter Web App,,0,0,0
1,1515608409730699264,1307634596985659395,@jjthxfortyping @_Hazpilicueta Many strikers d...,en,2022-04-17T08:29:38.000Z,Twitter Web App,,0,0,0
2,1515608399295270912,1475133084596981762,@SamuelOyinloy13 @VDVMaestro @kingofnorthall @...,en,2022-04-17T08:29:35.000Z,Twitter for Android,,0,0,0
3,1515608375782002691,1390728932228485123,"@Mysticalleo_ Spot on, however, even fergie, s...",en,2022-04-17T08:29:30.000Z,Twitter for iPhone,,0,0,0
4,1515608312787750913,1342456471167119360,@aymanfrmdao @Keith56962130 @LSPNFC_ Messi is ...,en,2022-04-17T08:29:15.000Z,Twitter for Android,,0,1,0


## Ronaldo

In [5]:
dataset_ronaldo.head(5)

Unnamed: 0,tweet_id,author_id,content,lang,date,source,geo,retweet_count,like_count,quote_count
0,1515589799943548929,1152201242766131202,There's no reason to believe Cristiano Ronaldo...,en,2022-04-17T07:15:41.000Z,Twitter for Android,,0.0,0.0,0.0
1,1515589796130865155,1202283615427727360,@TeamCRonaldo Vintage Ronaldo: Tottenham and N...,en,2022-04-17T07:15:40.000Z,Twitter Web App,,0.0,0.0,0.0
2,1515589765172768769,3368387687,Werra mister champions league Ronaldo? Why did...,en,2022-04-17T07:15:32.000Z,Twitter for iPhone,,0.0,0.0,0.0
3,1515589753470861318,2615336101,@kirmani_sadat @dayor_ray @OliverInce3 @Delerh...,en,2022-04-17T07:15:30.000Z,Twitter for Android,,0.0,0.0,0.0
4,1515589684013027335,1093900817046810625,Ronaldo was greatly rewarded after hat-trick a...,en,2022-04-17T07:15:13.000Z,IFTTT,,0.0,0.0,0.0


# Data cleaning 🧼🧽

## Messi

### Check for null values

In [6]:
dataset_messi.isna().sum()

tweet_id              0
author_id             0
content               0
lang                  0
date                  0
source                0
geo              146972
retweet_count         0
like_count            0
quote_count           0
dtype: int64

### Drop duplicates
Based on tweet ID

In [7]:
print(f"Row count before deletion: {len(dataset_messi.index)} ")
dataset_messi = dataset_messi.drop_duplicates(subset="tweet_id", keep="first")
print(f"Row count after deletion: {len(dataset_messi.index)}")

Row count before deletion: 148103 
Row count after deletion: 111542


#### Percentage of null geo values 

In [8]:
count = len(dataset_messi.index)
print(f"Number of values : {count}")

Number of values : 111542


In [9]:
count_null = dataset_messi.isna().sum()["geo"]
percentage = count_null*100/count
print(f"Percentage of null geo values {round(percentage,2)}%")

Percentage of null geo values 99.21%


### Assign the -1 id to null geo values

In [10]:
dataset_messi.geo = dataset_messi.geo.apply(lambda x: -1 if pd.isna(x) else x)

#### Display results

In [11]:
dataset_messi["geo"].head(5)

0    -1
1    -1
2    -1
3    -1
4    -1
Name: geo, dtype: object

In [12]:
print(f"Available locations ids: {dataset_messi['geo'].unique()[0:5]}")

Available locations ids: [-1 '02127c41c77b85ad' '28679b23ed15b380' '0102baa70911fc52'
 '0179a94c2ab6d1e9']


### Clean tweets : remove mentions, extra spaces and links

#### Utility function

In [13]:
def clean_tweet(text):
    """
    Removes punctuation, emojis, normalize whitespaces...from a text
    """
    
    text = clean(text,
                 no_punct=True,
                 lower=True,
                 no_emoji=True,
                 normalize_whitespace=True
                )
    
    return text

#### Regex

In [14]:
# Remove mentions
regex_mentions = r"@[A-Za-z0-9_]+"
# Remove links
regex_links = r"https?://[A-Za-z0-9./]+"
# Remove some special characters
regex_special = r"[^A-Za-z0-9]+"
# Remove numbers 
regex_numbers = r"[0-9]+"
# Remove ordinals 
regex_ordinals = r"[0-9]+(?:st| st|nd| nd|rd| rd|th| th)"

### Clean tweets : remove mentions, links and extra spaces

In [15]:
# Remove mentions
dataset_messi.content = dataset_messi.content.apply(lambda x: re.sub(regex_mentions, " ", str(x).strip()))
# Remove links 
dataset_messi.content = dataset_messi.content.apply(lambda x: re.sub(regex_links, " ", str(x).strip()))
# Remove special characters
dataset_messi.content = dataset_messi.content.apply(lambda x: re.sub(regex_special, " ", str(x).strip()))
# Remove ordinals
dataset_messi.content = dataset_messi.content.apply(lambda x: re.sub(regex_ordinals, " ", str(x).strip()))
# Remove numbers 
dataset_messi.content = dataset_messi.content.apply(lambda x: re.sub(regex_numbers, " ", str(x).strip()))
# Clean tweets
dataset_messi.content = dataset_messi.content.apply(lambda x: clean_tweet(x)) 

#### Display results

In [16]:
dataset_messi.head(5)

Unnamed: 0,tweet_id,author_id,content,lang,date,source,geo,retweet_count,like_count,quote_count
0,1515608437270454276,1464517603637014532,messi missed a pen against real madrid in the ...,en,2022-04-17T08:29:44.000Z,Twitter Web App,-1,0,0,0
1,1515608409730699264,1307634596985659395,many strikers does that also not to mention th...,en,2022-04-17T08:29:38.000Z,Twitter Web App,-1,0,0,0
2,1515608399295270912,1475133084596981762,lionel messi has goals in finals check anywhere,en,2022-04-17T08:29:35.000Z,Twitter for Android,-1,0,0,0
3,1515608375782002691,1390728932228485123,spot on however even fergie scholes beckham al...,en,2022-04-17T08:29:30.000Z,Twitter for iPhone,-1,0,0,0
4,1515608312787750913,1342456471167119360,messi is not a striker he s a playmaker giving...,en,2022-04-17T08:29:15.000Z,Twitter for Android,-1,0,1,0


### Check for non-english content

In [17]:
assert dataset_messi["lang"].unique()=="en", "Non-english content detected"

> Test passed

## Ronaldo

### Check for null values

In [18]:
dataset_ronaldo.isna().sum()

tweet_id              0
author_id             0
content               0
lang                  2
date                  4
source                2
geo              146217
retweet_count         2
like_count            4
quote_count           4
dtype: int64

### Drop null values
For columns relevant to our analysis

In [19]:
dataset_ronaldo.dropna(subset=["lang","date","source","retweet_count","like_count", "quote_count"], inplace=True)

In [20]:
dataset_ronaldo.isna().sum()

tweet_id              0
author_id             0
content               0
lang                  0
date                  0
source                0
geo              146215
retweet_count         0
like_count            0
quote_count           0
dtype: int64

### Drop duplicates

In [21]:
print(f"Row count before deletion: {len(dataset_ronaldo.index)} ")
dataset_ronaldo = dataset_ronaldo.drop_duplicates(subset="tweet_id", keep="first")
print(f"Row count after deletion: {len(dataset_ronaldo.index)}")

Row count before deletion: 148134 
Row count after deletion: 148134


### Percentage of null geo values

In [22]:
count = len(dataset_ronaldo.index)
print(f"Number of values: {count}")

Number of values: 148134


In [23]:
count_null = dataset_ronaldo.isna().sum()["geo"]
percentage = count_null * 100/count
print(f"Percentage of null geo values: {round(percentage,2)}%")

Percentage of null geo values: 98.7%


### Assign the -1 id to null geo values

In [24]:
dataset_ronaldo.geo = dataset_ronaldo.geo.apply(lambda x: -1 if pd.isna(x) else x)

#### Display results

In [25]:
dataset_ronaldo["geo"].head(5)

0    -1
1    -1
2    -1
3    -1
4    -1
Name: geo, dtype: object

In [26]:
print(f"Avaialble location ids: {dataset_ronaldo['geo'].unique()[0:5]}")

Avaialble location ids: [-1 '0d282a8da695a000' '511655fc081bb251' 'e4a0d228eb6be76b'
 '366946a7b72f271b']


### Clean tweets : remove mentions, links and extra spaces

In [27]:
# Remove mentions
dataset_ronaldo.content = dataset_ronaldo.content.apply(lambda x: re.sub(regex_mentions, " ", str(x).strip()))
# Remove links
dataset_ronaldo.content = dataset_ronaldo.content.apply(lambda x: re.sub(regex_links, " ", str(x).strip()))
# Remove special characters
dataset_ronaldo.content = dataset_ronaldo.content.apply(lambda x: re.sub(regex_special, " ", str(x).strip()))
# Remove ordinals
dataset_ronaldo.content = dataset_ronaldo.content.apply(lambda x: re.sub(regex_ordinals, " ", str(x).strip()))
# Remove numbers
dataset_ronaldo.content = dataset_ronaldo.content.apply(lambda x: re.sub(regex_numbers, " ", str(x).strip()))
# Clean tweets
dataset_ronaldo.content = dataset_ronaldo.content.apply(lambda x: clean_tweet(x)) 

#### Display results

In [28]:
dataset_ronaldo.head(5)

Unnamed: 0,tweet_id,author_id,content,lang,date,source,geo,retweet_count,like_count,quote_count
0,1515589799943548929,1152201242766131202,there s no reason to believe cristiano ronaldo...,en,2022-04-17T07:15:41.000Z,Twitter for Android,-1,0.0,0.0,0.0
1,1515589796130865155,1202283615427727360,vintage ronaldo tottenham and norwich tap in h...,en,2022-04-17T07:15:40.000Z,Twitter Web App,-1,0.0,0.0,0.0
2,1515589765172768769,3368387687,werra mister champions league ronaldo why did ...,en,2022-04-17T07:15:32.000Z,Twitter for iPhone,-1,0.0,0.0,0.0
3,1515589753470861318,2615336101,yeah but facts don t lie messi is missing awar...,en,2022-04-17T07:15:30.000Z,Twitter for Android,-1,0.0,0.0,0.0
4,1515589684013027335,1093900817046810625,ronaldo was greatly rewarded after hat trick a...,en,2022-04-17T07:15:13.000Z,IFTTT,-1,0.0,0.0,0.0


### Check for non-english content

In [29]:
assert dataset_ronaldo["lang"].unique()=="en", "Non-english content detected"

```
Test passed
```

# Save the cleaned datasets 💾

## Messi 

In [30]:
dataset_messi.to_csv("Cleaned_messi_tweets.csv", index=False)

## Ronaldo

In [31]:
dataset_ronaldo.to_csv("Cleaned_ronaldo_tweets.csv", index=False)