## 1. Lecture CSV

In [24]:
import pandas as pd
import plotly.express as px

In [25]:
df_spam = pd.read_csv("../datas/spam.csv", encoding="iso-8859-1")
df_spam.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


---

## 2. Préprocessing

#### Analyse des 3 colonnes `Unnamed`

In [26]:
df_spam.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [27]:
mask_unnamed_2 = (df_spam["Unnamed: 2"].notnull())
df_spam[mask_unnamed_2].head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
95,spam,Your free ringtone is waiting to be collected....,PO Box 5249,"MK17 92H. 450Ppw 16""",
281,ham,\Wen u miss someone,the person is definitely special for u..... B...,why to miss them,"just Keep-in-touch\"" gdeve.."""
444,ham,\HEY HEY WERETHE MONKEESPEOPLE SAY WE MONKEYAR...,HOWU DOIN? FOUNDURSELF A JOBYET SAUSAGE?LOVE ...,,
671,spam,SMS. ac sun0819 posts HELLO:\You seem cool,"wanted to say hi. HI!!!\"" Stop? Send STOP to ...",,
710,ham,Height of Confidence: All the Aeronautics prof...,"this wont even start........ Datz confidence..""",,


In [28]:
mask_unnamed_3 = (df_spam["Unnamed: 3"].notnull())
df_spam[mask_unnamed_3].head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
95,spam,Your free ringtone is waiting to be collected....,PO Box 5249,"MK17 92H. 450Ppw 16""",
281,ham,\Wen u miss someone,the person is definitely special for u..... B...,why to miss them,"just Keep-in-touch\"" gdeve.."""
899,spam,Your free ringtone is waiting to be collected....,PO Box 5249,"MK17 92H. 450Ppw 16""",
1038,ham,"Edison has rightly said, \A fool can ask more ...",GN,GE,"GNT:-)"""
2170,ham,\CAN I PLEASE COME UP NOW IMIN TOWN.DONTMATTER...,JUST REALLYNEED 2DOCD.PLEASE DONTPLEASE DONTIG...,"U NO THECD ISV.IMPORTANT TOME 4 2MORO\""""",


In [29]:
mask_unnamed_4 = (df_spam["Unnamed: 4"].notnull())
df_spam[mask_unnamed_4].head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
281,ham,\Wen u miss someone,the person is definitely special for u..... B...,why to miss them,"just Keep-in-touch\"" gdeve.."""
1038,ham,"Edison has rightly said, \A fool can ask more ...",GN,GE,"GNT:-)"""
2255,ham,I just lov this line: \Hurt me with the truth,I don't mind,i wil tolerat.bcs ur my someone..... But,"Never comfort me with a lie\"" gud ni8 and swe..."
3525,ham,\HEY BABE! FAR 2 SPUN-OUT 2 SPK AT DA MO... DE...,HAD A COOL NYTHO,TX 4 FONIN HON,"CALL 2MWEN IM BK FRMCLOUD 9! J X\"""""
4668,ham,"When I was born, GOD said, \Oh No! Another IDI...",GOD said,"\""OH No! COMPETITION\"". Who knew","one day these two will become FREINDS FOREVER!"""


Je vais concaténer ces colonnes avec la colonne qui contient le message

In [30]:
df_spam["v2"] = (df_spam["v2"].fillna("") + df_spam["Unnamed: 2"].fillna("") + df_spam["Unnamed: 3"].fillna("") + df_spam["Unnamed: 4"].fillna(""))
to_drop = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"]
df_spam = df_spam.drop(columns=to_drop)
df_spam.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


#### Colonnes label et message

Je renomme les colonnes : `v1` --> `label_text`, `v2` --> `message`

In [31]:
df_spam.columns = ["label_text", "message"]
df_spam.head()

Unnamed: 0,label_text,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


J'ajoute une colonne `label` qui encode le label ham / spam : `ham` --> `0`, `spam` --> `1`

In [32]:
df_spam["label"] = [0 if label == "ham" else 1 for label in df_spam["label_text"]]
df_spam.head()

Unnamed: 0,label_text,message,label
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


---

## 3. EDA

Distribution des ham et des spam

In [33]:
fig = px.histogram(df_spam, x="label_text", color_discrete_sequence=px.colors.qualitative.Pastel)
fig.update_layout(title="Répartition des ham/spam", yaxis_title="Nombre de messages", xaxis_title="Type de message", title_x=0.5)
fig.show()

### On note une très mauvaise répartition des labels : `spam` est beaucoup moins représenté

---

Sauvegarde du fichier préprocessé

In [34]:
df_spam.to_csv("../datas/spam_clean.csv", index=False)