# Social Network Analysis - Amber Heard Case - Facebook
MAHMOUD NAGY - FEB 2022

## Table of Contents
<ul>
<li><a href="#intro"><b>Introduction</b></a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#end">End of Notebook</a></li>  
</ul>

<a id='intro'></a>
## Introduction

> Facebook - Social Network Analysis on Amber Heard's Case Example.<br>
Threat Analysis on commments, reviews of her Facebook Page.

In [13]:
import pandas as pd
import numpy as np
import os
import time

import helpers

# To Apply helpers update\s without resarting the kernel
import importlib
importlib.reload(helpers)

<module 'helpers' from '/Users/mnagy99/jupyter/AH/Facebook/SNA-AH-Case-Facebook/Wrangling & Analysis/helpers.py'>

In [14]:
cd ../data

/Users/mnagy99/jupyter/AH/Facebook/SNA-AH-Case-Facebook/data


In [3]:
csv_files = []
for file in sorted(os.listdir()):
    if file.split(".")[-1] == "csv":
        csv_files.append(file)

In [4]:
csv_files

['From_Facebook_post_0.csv',
 'From_Facebook_post_1.csv',
 'From_Facebook_post_2.csv',
 'From_Facebook_post_3.csv',
 'From_Facebook_post_4.csv',
 'From_Facebook_post_5.csv',
 'From_Facebook_profile_review.csv',
 'From_Facebook_searchurl_0.csv',
 '_all_data - NLU labeled train 12.4K.csv']

In [5]:
len(csv_files)

9

<br>

<a id='wrangling'></a>
# Data Wrangling

<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling"><b>Data Wrangling</b></a></li>
<li><a href="#end">End of Notebook</a></li>  
</ul>

<a id='comments'></a>
>># Posts Comments
> <ul>
 <li><a href="#comments"><b>Posts Comments</b></a></li>  
 <li><a href="#reviews">Profile Reviews</a></li>  
 <li><a href="#nlu">NLU DATA</a></li>
 <li><a href="#merged">Merged DATA</a></li>
 </ul>

In [6]:
df_post0 = pd.read_csv(csv_files[0])
df_post1 = pd.read_csv(csv_files[1])
df_post2 = pd.read_csv(csv_files[2])
df_post3 = pd.read_csv(csv_files[3])
df_post4 = pd.read_csv(csv_files[4])
df_post5 = pd.read_csv(csv_files[5])
df_profile_review = pd.read_csv(csv_files[6])
df_nlu = pd.read_csv(csv_files[8])

In [7]:
dfs = [
    df_post0,
    df_post1,
    df_post2,
    df_post3,
    df_post4,
    df_post5,
]

### Create a column to classify the text type  (Comment / Recommend / Not Recommend)
Preparing for further merge

In [8]:
df_post0['type'] = 'post0_comment'
df_post1['type'] = 'post1_comment'
df_post2['type'] = 'post2_comment'
df_post3['type'] = 'post3_comment'
df_post4['type'] = 'post4_comment'
df_post5['type'] = 'post5_comment'

# Comments

In [9]:
df_comments = pd.concat(dfs)
print(df_comments.shape)
df_comments.head(2)

(4665, 3)


Unnamed: 0,userName,comment,type
0,Raychel RayRay,Skank,post0_comment
1,Rich Lawson,Not only are you a pretty rubbish actress you ...,post0_comment


### Create a column to classify the text type  (Comment / Recommend / Not Recommend)
Preparing for further merge

<br>

### Rename Columns

In [10]:
df_comments.rename(columns={'userName': 'username'}, inplace=True)

<br>

In [11]:
print(df_comments.shape)
df_comments.head(2)

(4665, 3)


Unnamed: 0,username,comment,type
0,Raychel RayRay,Skank,post0_comment
1,Rich Lawson,Not only are you a pretty rubbish actress you ...,post0_comment


<br>

### The Number of Repeated Comments on Different Posts by the Same User

In [12]:
print(f'\nThe Number of Repeated Comments on Different Posts by the Same User: {df_comments.duplicated().sum()}\n')



The Number of Repeated Comments on Different Posts by the Same User: 0



In [13]:
df_comments.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4665 entries, 0 to 362
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   username  4665 non-null   object
 1   comment   4659 non-null   object
 2   type      4665 non-null   object
dtypes: object(3)
memory usage: 145.8+ KB


# Deleted Comments

In [14]:
df_deleted = df_comments[df_comments.comment.isnull()]
df_deleted

Unnamed: 0,username,comment,type
260,Abrar Qader,,post0_comment
15,Dayse Alves,,post1_comment
41,Manfried von Thun,,post2_comment
100,Jodi Line,,post3_comment
7,Arthur Goebbels,,post4_comment
3,Liadan Valeska,,post5_comment


In [15]:
deleted_list = list(df_deleted.username)

In [16]:
df_comments[df_comments.username.isin(deleted_list)]

Unnamed: 0,username,comment,type
260,Abrar Qader,,post0_comment
15,Dayse Alves,,post1_comment
41,Manfried von Thun,,post2_comment
100,Jodi Line,,post3_comment
7,Arthur Goebbels,,post4_comment
3,Liadan Valeska,,post5_comment


<br>

<a id='reviews'></a>
>># Profile Reviews
> <ul>
 <li><a href="#comments">Posts Comments</a></li>  
 <li><a href="#reviews"><b>Profile Reviews</b></a></li>  
 <li><a href="#nlu">NLU DATA</a></li>
 <li><a href="#merged">Merged DATA</a></li>
 </ul>

# Profile Review Data

In [17]:
df_profile_review = pd.read_csv(csv_files[6])
print(df_profile_review.shape)
df_profile_review.head(2)

(657, 3)


Unnamed: 0,userName,recommend,text
0,Gianluca Carlesso,Gianluca Carlesso doesn't recommend Amber Heard.,"horrible person, that's the best description"
1,Danieru Tan,Danieru Tan doesn't recommend Amber Heard.,Poor acting. Bad character. Not recommended fo...


In [18]:
df_profile_review.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 657 entries, 0 to 656
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userName   657 non-null    object
 1   recommend  657 non-null    object
 2   text       657 non-null    object
dtypes: object(3)
memory usage: 15.5+ KB


<br>

### Create a column to classify the text type  (Comment / Recommend / Not Recommend)

In [19]:
conditions  = [df_profile_review['recommend'].str.contains("doesn't recommend"),
               df_profile_review['recommend'].str.contains('recommends')]

choices     = ['Not Recommend', 'Recommend']
    
df_profile_review["type"] = np.select(conditions, choices, default='...')

In [20]:
df_profile_review.type.value_counts()  

Not Recommend    615
Recommend         42
Name: type, dtype: int64

<br>

### Drop Extraneous columns

In [21]:
df_profile_review.drop(columns='recommend', inplace=True)

<br>

### Rename Columns

In [22]:
df_profile_review.rename(columns={'text': 'comment', 'userName': 'username'}, inplace=True)

<br>

In [23]:
print(df_profile_review.shape)
df_profile_review.head(2)

(657, 3)


Unnamed: 0,username,comment,type
0,Gianluca Carlesso,"horrible person, that's the best description",Not Recommend
1,Danieru Tan,Poor acting. Bad character. Not recommended fo...,Not Recommend


<br>

<a id='nlu'></a>
>># NLU DATA
> <ul>
 <li><a href="#comments">Posts Comments</a></li>  
 <li><a href="#reviews">Profile Reviews</a></li>  
 <li><a href="#nlu"><b>NLU DATA</b></a></li>
 <li><a href="#merged">Merged DATA</a></li>
 </ul>

# NLU

In [24]:
df_nlu = pd.read_csv(csv_files[8])
df_nlu.drop(columns='id', inplace=True)
print(df_nlu.shape)
df_nlu.head(2)

(12487, 5)


Unnamed: 0,comment_text,defense_AH,support_AH,offense_AH,defense_against_AH
0,"We love you Amber! Karen Ingala Smith, CEO of ...",0.0,1.0,0.0,0.0
1,She is a hero to domestic violence victims. Sh...,0.0,1.0,0.0,0.0


### Change Data Types

In [25]:
# change the type of the rounded values to boolean
df_nlu['defense_AH'] = df_nlu['defense_AH'].astype('bool')
df_nlu['support_AH'] = df_nlu['support_AH'].astype('bool')
df_nlu['offense_AH'] = df_nlu['offense_AH'].astype('bool')
df_nlu['defense_against_AH'] = df_nlu['defense_against_AH'].astype('bool')

### Rename Columns

In [26]:
df_nlu.rename(columns={'comment_text': 'comment'}, inplace=True)

<br>

In [27]:
print(df_nlu.shape)
df_nlu.head(2)

(12487, 5)


Unnamed: 0,comment,defense_AH,support_AH,offense_AH,defense_against_AH
0,"We love you Amber! Karen Ingala Smith, CEO of ...",False,True,False,False
1,She is a hero to domestic violence victims. Sh...,False,True,False,False


In [28]:
df_nlu.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12487 entries, 0 to 12486
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   comment             12487 non-null  object
 1   defense_AH          12487 non-null  bool  
 2   support_AH          12487 non-null  bool  
 3   offense_AH          12487 non-null  bool  
 4   defense_against_AH  12487 non-null  bool  
dtypes: bool(4), object(1)
memory usage: 146.5+ KB


<br>

<a id='merged'></a>
>># Merged DATA
> <ul>
 <li><a href="#comments">Posts Comments</a></li>  
 <li><a href="#reviews">Profile Reviews</a></li>  
 <li><a href="#nlu">NLU DATA</a></li>
 <li><a href="#merged"><b>Merged DATA</b></a></li>
 </ul>

In [29]:
# First Merge the Comments and Reviews data
df_merged = pd.concat([df_comments, df_profile_review])
print(df_merged.shape)
df_merged.head()

(5322, 3)


Unnamed: 0,username,comment,type
0,Raychel RayRay,Skank,post0_comment
1,Rich Lawson,Not only are you a pretty rubbish actress you ...,post0_comment
2,Zeyneb Yusein,"You're a shame and disgrace for your family, I...",post0_comment
3,William Mcneillie,😂 what's next in your game plan? Blame Arthur ...,post0_comment
4,Flash Garot,"No one cares, you suck",post0_comment


In [30]:
# Then Add the NLU to the merged data
df_merged_final = df_merged.merge(df_nlu, on='comment', how='left')
print(df_merged_final.shape)
df_merged_final.head()

(5322, 7)


Unnamed: 0,username,comment,type,defense_AH,support_AH,offense_AH,defense_against_AH
0,Raychel RayRay,Skank,post0_comment,,,,
1,Rich Lawson,Not only are you a pretty rubbish actress you ...,post0_comment,,,,
2,Zeyneb Yusein,"You're a shame and disgrace for your family, I...",post0_comment,,,,
3,William Mcneillie,😂 what's next in your game plan? Blame Arthur ...,post0_comment,,,,
4,Flash Garot,"No one cares, you suck",post0_comment,,,,


# Normalizing Text

In [31]:
df_merged_final.comment = df_merged_final.comment.str.lower()

In [32]:
df_offense = df_merged_final[df_merged_final.offense_AH == True]
print(df_offense.shape)
df_offense.head()

(1467, 7)


Unnamed: 0,username,comment,type,defense_AH,support_AH,offense_AH,defense_against_AH
158,Melvs Grao,ive watch your aqua man movie and in amaze...i...,post0_comment,False,True,True,False
217,Kaly Quesada,you are a horrible human. i hope you never get...,post0_comment,False,False,True,False
238,Juju Foulk,if 2020 were a person it’d be amber heard.,post0_comment,False,False,True,False
493,Haily Abram,sign and share to get biased justice nicol fir...,post0_comment,False,False,True,False
569,Devon Acosta,actual garage >>> amber heard,post0_comment,False,False,True,False


# Add A contains_alpha Column

In [112]:
df_merged_final['comment'] = df_merged_final['comment'].fillna('isnan')

In [113]:
df_merged_final['comment'].isnull().sum()

0

In [130]:
# def replace_nan(x):
#     if x == 'nan':
#         x = x.replace('nan', 'isnan')
#     return x

In [127]:
# df_merged_final['comment'] = df_merged_final['comment'].apply(lambda x: replace_nan(x))

In [129]:
# df_merged_final[df_merged_final.comment == 'isnan']

In [115]:
df_merged_final[df_merged_final.comment.isnull()]

Unnamed: 0,username,comment,type,defense_AH,support_AH,offense_AH,defense_against_AH,contains_alpha,clean_text,language,translated


In [34]:
import re
df_merged_final['contains_alpha'] = df_merged_final.comment.apply(lambda x: bool(re.match('^(?=.*[a-zA-Z])', x))) 

In [35]:
df_merged_final.contains_alpha.value_counts()

True     5175
False     147
Name: contains_alpha, dtype: int64

In [36]:
# df_merged_final.contains_alpha.sum()

In [37]:
# print(df_merged_final[~df_merged_final.contains_alpha].shape)
# df_merged_final[~df_merged_final.contains_alpha]

<br>

Sinhala, is an Indo-Aryan language primarily spoken by the Sinhalese people of Sri Lanka, who make up the largest ethnic group on the island, numbering about 16 million. Sinhala is also spoken as the first language by other ethnic groups in Sri Lanka, totaling about 4 million people as of 2001. Wikipedia

- muerte --> (Spanish) Death
- 死ね --> (Japanese) Die
- muerete --> (Spanish) Die
- morire --> (Italian) Die
- morir --> (Spanish) To die
- muere --> (Spanish) go dead
- matar --> (Portuguese) to kill

地獄で死ね --> Die in hell <br>
死ね --> Die <br>
死 --> Die <br>

- くたばれ この野郎。地獄で死ね！！！！<br>
(Japanese) Kutabare This bastard. Die in hell! !! !!
- ගොන් හුත්ති... තොට කෙලවෙලාම පල වේස බැල්ලි... <br>
(Sinhala) Bullshit ... bitch at the end of the mouth ...
- මේ පට්ටෞ ත්.ති.ගෙ දුව මැරිලාවත් යන් නෑනෙ <br>
(Sinhala) The daughter of this Pattau Ththi is not even dead
- පින්න බැල්ලි ද්ද්න් සතුටුද <br>
(Sinhala) Are the deer bitches happy?

In [38]:
df_merged_final[~df_merged_final.contains_alpha].comment.value_counts().head(60)

💩                                                                                                4
🤮🤮🤮🤮🤮                                                                                            4
🐍 🐍 🐍                                                                                            3
𝗗𝗶𝘀𝗹𝗶𝗸𝗲/𝗨𝗻𝗳𝗼𝗹𝗹𝗼𝘄 𝗔𝗺𝗯𝗲𝗿 𝗛𝗲𝗮𝗿𝗱 𝗼𝗻 𝗮𝗹𝗹 𝘀𝗼𝗰𝗶𝗮𝗹 𝗺𝗲𝗱𝗶𝗮. 𝗟𝗶𝗸𝗲 𝘁𝗵𝗶𝘀 𝗰𝗼𝗺𝗺𝗲𝗻𝘁 𝗮𝗻𝗱 𝘁𝗮𝗸𝗲 𝗶𝘁 𝘁𝗼 𝘁𝗵𝗲 𝘁𝗼𝗽.      2
.                                                                                                2
✨💓💪🏻🥰✨\r\n#deppcember\r\n#justiceforjohnnydeep\r\n…see more                                      2
🤮🤮🤮🤮🤮🤮🤮                                                                                          2
💩💩💩💩💩💩💩💩                                                                                         2
𝗗𝗶𝘀𝗹𝗶𝗸𝗲/𝗨𝗻𝗳𝗼𝗹𝗹𝗼𝘄 𝗔𝗺𝗯𝗲𝗿 𝗛𝗲𝗮𝗿𝗱 𝗳𝗿𝗼𝗺 𝗮𝗹𝗹 𝘀𝗼𝗰𝗶𝗮𝗹 𝗺𝗲𝗱𝗶𝗮. 𝗟𝗶𝗸𝗲 𝘁𝗵𝗶𝘀 𝗰𝗼𝗺𝗺𝗲𝗻𝘁 𝗮𝗻𝗱 𝘁𝗮𝗸𝗲 𝗶𝘁 𝘁𝗼 𝘁𝗵𝗲 𝘁𝗼𝗽.    2
✨ 😍💓🥰💪🏻😘✨\r\n#justiceforjohnnydeep\r\n#johnnydeppisinnocent\r\n…see more                         2
💩💩💩💩💩💩💩💩💩 

In [39]:
df_merged_final.loc[df_merged_final['comment'] == '💚 💚 💚', 'contains_alpha']

4602    False
Name: contains_alpha, dtype: bool

In [40]:
df_merged_final.loc[df_merged_final['comment'] == '💚 💚 💚', 'contains_alpha'].values[0]

False

In [41]:
df_merged_final.loc[df_merged_final['comment'] == '💚 💚 💚', 'contains_alpha'].tolist()

[False]

In [42]:
df_merged_final.loc[df_merged_final['comment'] == '💚 💚 💚', 'contains_alpha'].item()

False

# Remove Emojis

In [43]:
def remove_emojis(text):
    emoji_pattern = re.compile("["
                               "❤️"
                               "🥰"
                               "🤮"
                               "🤢"
                               "🤬"
                               "🤏"
                               "🤡"
                               "🥴"
                               "✨"
                               "🤜"
                               "🤛"
            u"\U0001F600-\U0001F64F" 
            u"\U0001F300-\U0001F5FF"  
            u"\U0001F680-\U0001F6FF"  
            u"\U0001F1E0-\U0001F1FF"  
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [44]:
# normaliztion
df_merged_final['clean_text'] = df_merged_final.comment.apply(lambda x: remove_emojis(x))

In [45]:
df_merged_final.clean_text = df_merged_final.clean_text.apply(
    lambda x: x.replace('"', '').replace('\\', '').replace('/', '').replace('.', '').replace('-', '').replace('…', '').replace('(', '').replace(')', '').replace('¯', '').replace('\t', ' ').replace('\n', ' ').replace('\r', '').replace('       ',' ').replace('      ',' ').replace('     ',' ').replace('   ',' ').replace('  ',' '))

In [46]:
df_merged_final[~df_merged_final.contains_alpha].clean_text.value_counts().head(60)

                                                                                                                                         90
                                                                                                                                         11
 #deppcember #justiceforjohnnydeep see more                                                                                               2
𝗗𝗶𝘀𝗹𝗶𝗸𝗲𝗨𝗻𝗳𝗼𝗹𝗹𝗼𝘄 𝗔𝗺𝗯𝗲𝗿 𝗛𝗲𝗮𝗿𝗱 𝗼𝗻 𝗮𝗹𝗹 𝘀𝗼𝗰𝗶𝗮𝗹 𝗺𝗲𝗱𝗶𝗮 𝗟𝗶𝗸𝗲 𝘁𝗵𝗶𝘀 𝗰𝗼𝗺𝗺𝗲𝗻𝘁 𝗮𝗻𝗱 𝘁𝗮𝗸𝗲 𝗶𝘁 𝘁𝗼 𝘁𝗵𝗲 𝘁𝗼𝗽                                                  2
 #justiceforjohnnydeep #johnnydeppisinnocent see more                                                                                     2
𝗗𝗶𝘀𝗹𝗶𝗸𝗲𝗨𝗻𝗳𝗼𝗹𝗹𝗼𝘄 𝗔𝗺𝗯𝗲𝗿 𝗛𝗲𝗮𝗿𝗱 𝗳𝗿𝗼𝗺 𝗮𝗹𝗹 𝘀𝗼𝗰𝗶𝗮𝗹 𝗺𝗲𝗱𝗶𝗮 𝗟𝗶𝗸𝗲 𝘁𝗵𝗶𝘀 𝗰𝗼𝗺𝗺𝗲𝗻𝘁 𝗮𝗻𝗱 𝘁𝗮𝗸𝗲 𝗶𝘁 𝘁𝗼 𝘁𝗵𝗲 𝘁𝗼𝗽                                                2
гори в аду ембер блядддь                                                                                                                  1
а грин бегущая по во

# Add A Language-Detection Column

https://stackoverflow.com/questions/39142778/python-how-to-determine-the-language

## langdetect
inaccurate

In [47]:
from langdetect import detect
detect('no one cares, you suck')

'es'

In [48]:
# def lang_detect(x):
#     if df_merged_final.loc[df_merged_final['clean_text'] == x, 'contains_alpha'].values[0]:
#         language = detect(x)
#     else:
#         language = 'Other'
#     return language

In [49]:
# def det(x):
#     try:
#         lang = detect(x)
#     except:
#         lang = 'Other'
#     return lang

In [50]:
# df_merged_final['language'] = df_merged_final.clean_text.apply(lambda x: det(x)) 

In [51]:
# df_merged_final['language'] = df_merged_final.comment.apply(
#     lambda x: 'none' if len(x)<= 5 else lang_detect(x)) 

In [52]:
# df_merged_final['language'] = df_merged_final.comment.apply(
#     lambda x: 'none' if not df_merged_final.loc[df_merged_final['comment'] == x, 'contains_alpha'].item() else detect(x)) 



In [53]:
# df_merged_final.language.value_counts()

## TextBlob 
HTTPError: HTTP Error 429: Too Many Requests

In [54]:
# from textblob import TextBlob
# b = TextBlob("bonjour")
# b.detect_language()

In [55]:
# def det(x):
#     try:
#         b = TextBlob(x)
#         lang = b.detect_language()
#     except:
#         lang = 'Other'
#     return lang

In [56]:
# t0 = time.time()
# df_merged_final['language'] = df_merged_final.clean_text.apply(lambda x: det(x)) 
# (time.time() - t0)/60

In [57]:
# df_merged_final.language.value_counts()

## googletrans
AttributeError: 'Translator' object has no attribute 'raise_Exception'

In [58]:
# from googletrans import Translator
# t = Translator().detect("hello world!")
# t.lang 

In [59]:
# def det(x):
#     try:
#         t = Translator().detect(x)
#         lang = t.lang 
#     except:
#         lang = 'Other'
#     return lang

In [60]:
# t0 = time.time()
# df_merged_final['language'] = df_merged_final.clean_text.apply(lambda x: det(x)) 
# (time.time() - t0)/60

In [61]:
# df_merged_final.language.value_counts()

## langid
Inaccurate

In [62]:
import langid
langid.classify("This is a test")[0]

'en'

In [63]:
langid.classify("no one cares, you suck")[0]

'it'

In [64]:
# def det(x):
#     try:
#         lang = langid.classify(x)[0]
#     except:
#         lang = 'Other'
#     return lang

In [65]:
# t0 = time.time()
# df_merged_final['language'] = df_merged_final.clean_text.apply(lambda x: det(x)) 
# (time.time() - t0)/60

In [66]:
# df_merged_final.language.value_counts()

## chardet
Chardet has also a feature of detecting languages if there are character bytes in range (127-255]:

In [67]:
import chardet
chardet.detect("Я люблю вкусные пампушки".encode('cp1251'))['language']

'Russian'

In [68]:
chardet.detect("no one cares, you suck".encode('cp1251'))['language']

''

In [69]:
def det(x):
    try:
        lang = chardet.detect(x.encode('cp1251'))['language']
    except:
        lang = 'Other'
    return lang

In [70]:
# t0 = time.time()
# df_merged_final['language'] = df_merged_final.clean_text.apply(lambda x: det(x)) 
# (time.time() - t0)/60

In [71]:
# df_merged_final.language.value_counts()

## guess_language
Can detect very short samples by using this spell checker with dictionaries.

In [72]:
from guess_language import guessLanguage
guessLanguage('ボウリング・フォー・コロンバイン(字幕版)')

'ja'

In [73]:
guessLanguage('no one cares, you suck')

'it'

In [74]:
# def det(x):
#     try:
#         lang = chardet.detect(x.encode('cp1251'))['language']
#     except:
#         lang = 'Other'
#     return lang

In [75]:
# t0 = time.time()
# df_merged_final['language'] = df_merged_final.clean_text.apply(lambda x: det(x)) 
# (time.time() - t0)/60

In [76]:
# df_merged_final.language.value_counts()

## pycld3
pycld3 is a neural network model for language identification. This package contains the inference code and a trained model. <br>
Inaccurate

In [77]:
import cld3
c = cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度")
print(c)
c[0]

LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)


'zh'

In [78]:
cld3.get_language("no one cares, you suck")[0]

'pt'

In [79]:
# cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度")[0]

In [80]:
# def det(x):
#     try:
#         lang = cld3.get_language(x)[0]
#     except:
#         lang = 'Other'
#     return lang

In [81]:
# t0 = time.time()
# df_merged_final['language'] = df_merged_final.clean_text.apply(lambda x: det(x)) 
# (time.time() - t0)/60

In [82]:
# df_merged_final.language.value_counts()

## polyglot
Able to detect texts with mixed languages.

- install icu in anaconda

In [83]:
from polyglot.detect import Detector

mixed_text = u"""
China (simplified Chinese: 中国; traditional Chinese: 中國),
officially the People's Republic of China (PRC), is a sovereign state
located in East Asia.
"""
for language in Detector(mixed_text).languages:
    print(language)

name: English     code: en       confidence:  87.0 read bytes:  1154
name: Chinese     code: zh_Hant  confidence:   5.0 read bytes:  1755
name: un          code: un       confidence:   0.0 read bytes:     0


In [84]:
for language in Detector('no one cares, you suck').languages:
    print(language)

name: English     code: en       confidence:  95.0 read bytes:  1349
name: un          code: un       confidence:   0.0 read bytes:     0
name: un          code: un       confidence:   0.0 read bytes:     0


In [85]:
# def det(x):
#     try:
#         lang = cld3.get_language(x)[0]
#     except:
#         lang = 'Other'
#     return lang

In [86]:
# t0 = time.time()
# df_merged_final['language'] = df_merged_final.clean_text.apply(lambda x: det(x)) 
# (time.time() - t0)/60

In [87]:
# df_merged_final.language.value_counts()

# pycld2
First, if you only need polyglot for language detection, you'd better use pycld2 directly, that is what used behind the scenes. It has much cleaner API.

https://stackoverflow.com/questions/51503199/how-to-apply-polyglot-detector-function-to-dataframe

In [88]:
import icu
poly_obj = Detector('no one cares, you suck', quiet=True)
icu.Locale.getDisplayName(poly_obj.language.locale)

'English'

In [89]:
def det(x):
    try:
        poly_obj = Detector(x, quiet=True)
        lang = icu.Locale.getDisplayName(poly_obj.language.locale)
    except:
        lang = 'Other'
    return lang

https://github.com/aboSamoor/polyglot/issues/223

In [90]:
from polyglot.detect.base import logger as polyglot_logger
polyglot_logger.setLevel("ERROR")

In [91]:
t0 = time.time()
df_merged_final['language'] = df_merged_final.clean_text.apply(lambda x: det(x)) 
(time.time() - t0)/60

0.004033565521240234

un --> unknown language

In [92]:
df_merged_final.language.value_counts().head(20)

English        3861
Spanish         549
un              193
Manx             80
Danish           76
Portuguese       41
Indonesian       28
Scots            25
Norwegian        25
German           22
Tagalog          18
Welsh            18
Sinhala          17
Maori            16
Afrikaans        14
Maltese          12
Interlingue      11
Quechua          11
Latin            10
Italian          10
Name: language, dtype: int64

# Add A Translation Column

In [93]:
# from deep_translator import GoogleTranslator

# translated = GoogleTranslator(source='auto', target='en').translate_batch(['ဖေလိုးမ ဖာသယ်မ'])
# translated

https://stackoverflow.com/questions/70673172/how-to-solve-text-must-be-a-valid-text-with-maximum-5000-character-otherwise-it

### Better Approach 

In [94]:
# def translate(x):
#     try:
#         translated = GoogleTranslator(source='auto', target='en').translate_batch(x)
#     except:
#         translated = 'not_translated'
#     return translated

In [95]:
# import time 
# t0 = time.time()

In [96]:
# df_merged_final['translated'] = df_merged_final.comment.apply(
#     lambda x: x if df_merged_final.loc[df_merged_final['comment'] == x,
#                                        'language'].values[0] == 'en' else translate(x)) 

In [97]:
# time.time() - t0

In [98]:
# df_merged_final[df_merged_final.language != 'en'].head()

In [99]:
# print((df_merged_final.translated == 'not_translated').sum())
# df_merged_final[df_merged_final.translated == 'not_translated'].head()

In [100]:
# df_merged_final.comment[4]

In [101]:
# translated = GoogleTranslator(source='auto', target='en').translate_batch(['no one cares, you suck'])
# translated

# Google Translate API 
Paid

https://cloud.google.com/translate/docs/basic/translating-text?hl=it

# googletrans

https://stackoverflow.com/questions/52455774/googletrans-stopped-working-with-error-nonetype-object-has-no-attribute-group

In [102]:
# text = 'This site is awesome'
# from googletrans import Translator
# translator = Translator()
# translator.translate(text , dest ='sw').text

# mtranslate

https://github.com/mouuff/mtranslate

In [103]:
import mtranslate
mtranslate.translate("Bonjour","en","auto")

'Hello'

In [104]:
def translate(x):
    try:
        translated = mtranslate.translate(x,"en","auto")
    except:
        translated = 'not_translated'
    return translated

In [105]:
t0 = time.time()
df_merged_final['translated'] = df_merged_final.clean_text.apply(
    lambda x: x if df_merged_final.loc[df_merged_final['clean_text'] == x,
                                       'language'].values[0] == 'English' else translate(x)) 
(time.time() - t0)/60

9.836498113473256

In [106]:
df_merged_final[df_merged_final.language == 'English'].head(60)

Unnamed: 0,username,comment,type,defense_AH,support_AH,offense_AH,defense_against_AH,contains_alpha,clean_text,language,translated
0,Raychel RayRay,skank,post0_comment,,,,,True,skank,English,skank
1,Rich Lawson,not only are you a pretty rubbish actress you ...,post0_comment,,,,,True,not only are you a pretty rubbish actress you ...,English,not only are you a pretty rubbish actress you ...
2,Zeyneb Yusein,"you're a shame and disgrace for your family, i...",post0_comment,,,,,True,"you're a shame and disgrace for your family, i...",English,"you're a shame and disgrace for your family, i..."
3,William Mcneillie,😂 what's next in your game plan? blame arthur ...,post0_comment,,,,,True,what's next in your game plan? blame arthur c...,English,what's next in your game plan? blame arthur c...
4,Flash Garot,"no one cares, you suck",post0_comment,,,,,True,"no one cares, you suck",English,"no one cares, you suck"
5,Joy Bonassoli,i think you should apologize!\r\nyou tried to ...,post0_comment,,,,,True,i think you should apologize! you tried to des...,English,i think you should apologize! you tried to des...
6,Therese Stellato,its time to face the truth! you cant keep pain...,post0_comment,,,,,True,its time to face the truth! you cant keep pain...,English,its time to face the truth! you cant keep pain...
7,Camron Jonson,how did johnny dep lost his position as jack s...,post0_comment,,,,,True,how did johnny dep lost his position as jack s...,English,how did johnny dep lost his position as jack s...
8,Debra Ann Ray,you would show class if you leave johnny alone...,post0_comment,,,,,True,you would show class if you leave johnny alone...,English,you would show class if you leave johnny alone...
9,Cesca Emilia,it's hard enough for an abused person to come ...,post0_comment,,,,,True,it's hard enough for an abused person to come ...,English,it's hard enough for an abused person to come ...


In [107]:
df_merged_final[df_merged_final.language != 'English'].head(60)

Unnamed: 0,username,comment,type,defense_AH,support_AH,offense_AH,defense_against_AH,contains_alpha,clean_text,language,translated
28,Fiorella Guerrero,amber heard hizo añicos dos años de vida de un...,post0_comment,,,,,True,amber heard hizo añicos dos años de vida de un...,Spanish,amber heard shattered two years of the life of...
120,Mishel Judit Quisocala Condori,"aquí desde perú nadie te quiere volver a ver, ...",post0_comment,,,,,True,"aquí desde perú nadie te quiere volver a ver, ...",Spanish,"here from peru no one wants to see you again, ..."
156,Nayd Tania Blanco Quispe,tarde o temprano llegará la hora de pagar todo...,post0_comment,,,,,True,tarde o temprano llegará la hora de pagar todo...,Spanish,Sooner or later the time will come to pay for ...
204,Miriam Núñez Hurtado,que triste que puedas vivir con tu consciencia...,post0_comment,,,,,True,que triste que puedas vivir con tu consciencia...,Spanish,How sad that you can live with your conscience...
206,Ramon Diaz,que impotencia\r\neres una mujer con baja cali...,post0_comment,,,,,True,que impotencia eres una mujer con baja calidad...,Spanish,how helpless you are a woman with low human qu...
208,Luis Oscar Torres Garcia,la señorita heard deberia disculparse publicam...,post0_comment,,,,,True,la señorita heard deberia disculparse publicam...,Spanish,miss heard should publicly apologize and recei...
240,Yulieth Neita Alvarez,amber heard ojala y no te arrepientas de todo ...,post0_comment,,,,,True,amber heard ojala y no te arrepientas de todo ...,Spanish,amber heard hopefully and you don't regret all...
245,Gustavo Brrz,"vuelve a la escuela de actuación, horrible la ...",post0_comment,,,,,True,"vuelve a la escuela de actuación, horrible la ...",Spanish,"go back to acting school, horrible aquama movi..."
260,Abrar Qader,,post0_comment,,,,,True,,Haitian Creole,in
268,Mayra Montesinos,como te sientes ahora que todo el mundo te da ...,post0_comment,,,,,True,como te sientes ahora que todo el mundo te da ...,Spanish,How do you feel now that everyone turns their ...


In [108]:
print((df_merged_final.translated == 'not_translated').sum())
df_merged_final[df_merged_final.translated == 'not_translated'].head()

0


Unnamed: 0,username,comment,type,defense_AH,support_AH,offense_AH,defense_against_AH,contains_alpha,clean_text,language,translated


In [131]:
# df_merged_final.to_csv('./cleaned_data/facebook_cleaned.csv', index = False)

<br>

# More Cleaning

In [15]:
df_comments = pd.read_csv("./cleaned_data/facebook_cleaned.csv")
print(df_comments.shape)
df_comments.head(1)

(5263, 11)


Unnamed: 0,username,comment,type,defense_AH,support_AH,offense_AH,defense_against_AH,contains_alpha,clean_text,language,translated
0,Raychel RayRay,skank,post0_comment,,,,,True,skank,English,skank


In [4]:
pos_text = {'love amber', 'stand with amber', 'standwithamber', 'support amber', 'supportamber', 'justiceforamber', 
            'johnnydeppisawifebeater', 'boycottwomenbeaters', 'wearewithamber','justice for amber','gorgeous amber',
            'istandwithamber','wearewithyouamber', 'amber heard is innocent', 'amber is innocent','support her'}
df_pos = df_comments[df_comments.comment.str.contains('|'.join(pos_text))]
print(df_pos.shape)

(16, 11)


In [5]:
df_recommend = df_comments[df_comments.type == 'Recommend']
print(df_recommend.shape)

(42, 11)


In [6]:
remove_users = set(df_pos.username) | set(df_recommend.username) | {'Melvs Grao'}

In [7]:
# exclude all the users with positive comments
df_comments = df_comments[~df_comments.username.isin(remove_users)]

### Fill NaNs with the string 'isnan'

In [8]:
df_comments['clean_text'].isnull().sum()

96

In [9]:
df_comments['clean_text'] = df_comments['clean_text'].fillna('isnan')

In [10]:
df_comments['translated'].isnull().sum()

101

In [11]:
df_comments['translated'] = df_comments['translated'].fillna('isnan')

In [12]:
# df_comments.to_csv('./cleaned_data/facebook_cleaned.csv', index = False)

# Comments Classified offense_AH

In [16]:
df_offense = df_comments[df_comments.offense_AH == True]
print(df_offense.shape)
df_offense.head(1)

(1458, 11)


Unnamed: 0,username,comment,type,defense_AH,support_AH,offense_AH,defense_against_AH,contains_alpha,clean_text,language,translated
216,Kaly Quesada,you are a horrible human. i hope you never get...,post0_comment,False,False,True,False,True,you are a horrible human i hope you never get ...,English,you are a horrible human i hope you never get ...


# Comments Classified defense_against_AH

In [17]:
df_defense_against = df_comments[df_comments.defense_against_AH == True]
print(df_defense_against.shape)
df_defense_against.head(1)

(8, 11)


Unnamed: 0,username,comment,type,defense_AH,support_AH,offense_AH,defense_against_AH,contains_alpha,clean_text,language,translated
2426,Jessica Hoskins,i used to be a fan but after what you did to j...,post1_comment,False,False,True,True,True,i used to be a fan but after what you did to j...,English,i used to be a fan but after what you did to j...


# Profile Reviews that don't recommend

In [18]:
df_not_recommend = df_comments[df_comments.type == 'Not Recommend']
print(df_not_recommend.shape)
df_not_recommend.head(1)

(612, 11)


Unnamed: 0,username,comment,type,defense_AH,support_AH,offense_AH,defense_against_AH,contains_alpha,clean_text,language,translated
4651,Gianluca Carlesso,"horrible person, that's the best description",Not Recommend,False,False,True,False,True,"horrible person, that's the best description",English,"horrible person, that's the best description"


In [None]:
# helpers.isnull(df_comments)

# Deleted Comments

In [19]:
df_deleted = df_comments[df_comments.comment == 'isnan']
df_deleted

Unnamed: 0,username,comment,type,defense_AH,support_AH,offense_AH,defense_against_AH,contains_alpha,clean_text,language,translated
259,Abrar Qader,isnan,post0_comment,,,,,True,isnan,Haitian Creole,in
2395,Dayse Alves,isnan,post1_comment,,,,,True,isnan,Haitian Creole,in
2872,Manfried von Thun,isnan,post2_comment,,,,,True,isnan,Haitian Creole,in
3370,Jodi Line,isnan,post3_comment,,,,,True,isnan,Haitian Creole,in
3658,Arthur Goebbels,isnan,post4_comment,,,,,True,isnan,Haitian Creole,in
4294,Liadan Valeska,isnan,post5_comment,,,,,True,isnan,Haitian Creole,in


In [20]:
# deleted_list = list(df_deleted.username)
# df_comments[df_comments.username.isin(deleted_list)]

In [21]:
## Use Fillna Instead of Dropna
# df_comments.dropna(subset=['comment'], inplace=True)

<br>

<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#end"><b>End of Notebook</b></a></li>  
</ul>

<a id = 'end'><a/>
# END OF NOTEBOOK

<br>