## Q1: What proportion of tweets are actually in Pidgin English?
**`Goal:`** Determine how important it is to account for Pidgin English in the dataset

### 1. Import Packages

In [7]:
import pandas as pd

### 2. Import annotated dataset

In [8]:
lang_labelled = pd.read_csv('../../data/interim/lang_sample_labelled.csv')
lang_labelled.head()

Unnamed: 0,text,language
0,Please who is using https://t.co/LfA6GHacrA i ...,eng
1,@Adeorgomez @Clan_Clueless 4G... Now this is 3...,eng
2,@GucciJ9 @IgbokweKo @Eshenicy @DrJoeAbah @Spec...,eng
3,@Spectranet_NG what’s going on with your netwo...,eng
4,Shame on you @ntelng @ntelcare you guys promis...,eng


In [9]:
print(f"There are {len(lang_labelled)} tweets in the dataset")

There are 78 tweets in the dataset


### 3. Compute proportion of tweets that are in Pidgin English

In [10]:
lang_labelled.language.value_counts()

eng    66
pdg    12
Name: language, dtype: int64

In [11]:
lang_labelled.language.value_counts(normalize=True)

eng    0.846154
pdg    0.153846
Name: language, dtype: float64

Only 15% (12 tweets) of the 78 labelled tweets were in Pidgin English. Based on my labelling experience, most of these tweets were also in light Pidgin English (i.e. still featured a major portion of the sentence in grammatically correct plain English). This is explored below:

### 4. Exploring tweets containing Pidgin English

In [18]:
for idx, pdg_tweet in enumerate(lang_labelled.query(" language == 'pdg' ")['text']):
    
    #Remove new line character
    pdg_tweet = pdg_tweet.replace('\n',"")
    
    #Print tweet
    print(str(idx+1)+')', pdg_tweet, '\n')

1) Let me just transfer money for my next subscription to my Spectranet purse before story will enter... 

2) @fimiletoks @mickey2ya @graffiti06 Tizeti is not scam o!They are the most gigantic scam. Dey show me fefe. 

3) @Spectranet_NG what's up with your speeds na? 

4) @eronmose1e @moyesparkle @whittyyumees @Spectranet_NG My brother all na scam but you see that spectranet ehn na sinzu them be, they Dey scam die! Internet speed self has been horrible 🤦🏽‍♂️ 

5) @bols_bols1 @Spectranet_NG You are special na 

6) @Tukooldegreat Baba spectranet na scam, the 100gb finishes in 1 week, not as if I use the data to watch porn 😔 

7) @aboyowa_e @Spectranet_NG Lmaoo! Na so, turn up!! 

8) @Spectranet_NG , see no make me swear for you! Fix your wacky internet connection around Yaba! 

9) MTNN @MTNNG  and spectranet if you guys are not going to dash us data atleast come correct on your services.We can't be wasting money in these glorious times. 

10) @rakspd You no see as I dey complain of @Spec

As show above, a lot of the tweets still contain a significant proportion of the sentence in grammatically correct English. Hence, accounting for Pidgin English might not be very important given a monolingual model should still be able to predict sentiment accurately.

**Note for generalizability of the above analysis:** The labelled sample was quite small. However, the tweets were randomly selected to increase the likelihood of representativeness