# Une mesure de l'inflation liée aux journaux - Projet Python pour la Data-science

### Auteurs : Lise Marchal, Raphaël Pereira et Raphaël Zambélli--Palacio

Ce notebook a pour objectif de présenter les travaux de recherche effectués dans le cadre du cours de Projet Python pour la data-science de la 2A ENSAE.

## Problématique :

Les périodes caractérisées par un taux d'inflation élevé coïcident-elles avec celles de traitement médiatique accru de cette thématique ?


In [34]:
import pandas as pd
import nltk
nltk.download("stopwords")
nltk.download("wordnet")
from nltk.corpus import stopwords
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rapam\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rapam\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Nettoyage du texte

In [23]:
df=pd.read_parquet("ArticlesInflation/AllInflation.parquet")

In [12]:
def clean_text(text):
    stop_words = set(stopwords.words("english"))
    text = text.lower()
    # Remove punctuation and stop words
    text = "".join([char for char in text if char not in string.punctuation])
    words = text.split()
    return " ".join([word for word in words if word not in stop_words])

def lemmatize_text(text):
    lemmatizer = nltk.WordNetLemmatizer()
    words = text.split()
    return " ".join([lemmatizer.lemmatize(word) for word in words])

def dereference_dico(dico):
    return dico["article"]

Ici la tokenization est implicite : les tokens sont chaque mot, cette approche peut être problématique si jamais les textes ont un format inhabituel comme des contractions ou des urls etc... Mais dans le cadre d'articles de presse, surtout d'anciens articles de presse il ne devrait y avoir aucun soucis

In [25]:
corpus=df["Article"]
corpus=corpus.apply(dereference_dico)
print(corpus.head())
corpus=corpus.apply(clean_text)
print(corpus.head())

0    BEST MAKES OF guaranteed tires at\n\n\nless th...
1    MAKE YOUR WALL PAPER clean and\n\n\nsweet agai...
2    STORIES. POEMS, PLAYS. ETC., are\n\n\nwanted f...
3    VV1ANTED-An experienced man on punch press and...
4    Have you lost a sum of money? Glasses. Pins an...
Name: Article, dtype: object
0    best makes guaranteed tires less dealers pay f...
1    make wall paper clean sweet simple formula suc...
2    stories poems plays etc wanted publication goo...
3    vv1antedan experienced man punch press eyelet ...
4    lost sum money glasses pins rings found surpri...
Name: Article, dtype: object


In [26]:
corpus=corpus.apply(lemmatize_text)
df["Treated"]=corpus

In [13]:
df["Article"]=df["Article"].apply(dereference_dico)
print(df.head())

                 Title     Date  \
0        The commoner.  1919-01   
1        The commoner.  1919-01   
2        The commoner.  1919-01   
3  New Britain herald.  1919-01   
4  New Britain herald.  1919-01   

                                             Article  
0  BEST MAKES OF guaranteed tires at\n\n\nless th...  
1  MAKE YOUR WALL PAPER clean and\n\n\nsweet agai...  
2  STORIES. POEMS, PLAYS. ETC., are\n\n\nwanted f...  
3  VV1ANTED-An experienced man on punch press and...  
4  Have you lost a sum of money? Glasses. Pins an...  


In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

corpus=df["Treated"]
x = vectorizer.fit_transform(corpus)

In [16]:
print(type(x))
print(x.shape)

<class 'scipy.sparse._csr.csr_matrix'>
(3957020, 13988006)


In [33]:
toLabeled=df.sample(n=10000, random_state=42)
toLabeled=toLabeled[["Title","Date","Article"]]

In [34]:
print(toLabeled.head())

                        Title     Date  \
122716    New Britain herald.  1930-05   
557122      Norwich bulletin.  1920-07   
40954           Evening star.  1960-10   
73882   The Washington times.  1932-03   
131569          Evening star.  1952-10   

                                                  Article  
122716  New York, May ? [Pl-Curb prices huttered irreg...  
557122  number in town have received ti.e handsome. Da...  
40954   TRUSTEES SALE OF VALUABLE\nTWO-STORY BRICK DWE...  
73882   The market value of so repre-,\nsentative stoc...  
131569  the second floor there was displayed "a\ncoron...  


In [36]:
toLabeled.to_csv("ArticlesInflation/ToLabeled.csv")
toLabeledReduit=toLabeled.sample(n=100, random_state=42)
toLabeledReduit.to_csv("ArticlesInflation/ToLabeledReduit.csv")

Selection des articles parlant d'inflation à 100%

In [9]:
mots_inflation=["inflation", "disinflation", "inflationary", "deflation", "devaluation","recession","price level", "wage growth", "economic downturn", "monetary policy","inflation rate","interest rates", "price stability", "consumption basket", "purchasing power"]
articlesInflationSur=df[df["Article"].str.contains("|".join(mots_inflation), case=False)] # marche grâce aux regex
print(articlesInflationSur.head())

In [15]:
print(articlesInflationSur.shape)

(56337, 3)


In [20]:
toLabeled=articlesInflationSur.sample(n=100, random_state=42)
toLabeled.to_csv("ArticlesInflation/toLabelled.csv")

Utilisation de ChatGPT et vérifié à la main pour label les données

In [41]:
labeled=pd.read_csv("ArticlesInflation/ManuallyLabelledArticles.csv")
print(labeled.head())
print(labeled.shape)

   Unnamed: 0                     Title     Date  \
0       87446    Imperial Valley press.  1944-11   
1      133641  The Daily Alaska empire.  1935-09   
2       24042             Smyrna times.  1955-09   
3       37164             Evening star.  1934-06   
4       29819     The Washington times.  1934-05   

                                             Article     Label  
0  Here Gre 8 big reasons for buying tho\nsanst y...  Positive  
1  Designed to bring the farm program within the\...   Neutral  
2  A federal bearing to determine\nwhether the mi...   Neutral  
3  By the Associated Press.\n\n\nMembership in th...  Positive  
4  1--WE.\n\n\nThese three phases are not un-\nre...  Negative  
(100, 5)


In [42]:
print(labeled["Label"].value_counts())

Label
Neutral     38
Negative    35
Positive    27
Name: count, dtype: int64


In [43]:
texts=labeled["Article"]
labels=labeled["Label"]
texts=texts.apply(clean_text)
texts=texts.apply(lemmatize_text)

In [44]:
def numerical_label(label):
    if label=="Positive":
        return 0
    elif label=="Neutral":
        return 1
    else:
        return 2
labels=labels.apply(numerical_label)

In [46]:
print(labels.value_counts())
print(labels.isna().sum())

Label
1    38
2    35
0    27
Name: count, dtype: int64
0


In [47]:

vectorizer = TfidfVectorizer()
x=vectorizer.fit_transform(texts)

In [48]:
y=labels
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [49]:
from sklearn.svm import SVC
svm_model = SVC()
svm_model.fit(x_train, y_train)

In [53]:
from sklearn.metrics import accuracy_score, classification_report

# Predict on the test set
y_pred = svm_model.predict(x_test)
# Evaluate the performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


[1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1]
83    1
53    2
70    2
45    1
44    2
39    1
22    1
80    1
10    0
0     0
18    2
30    1
73    2
33    0
90    1
4     2
76    0
77    2
12    2
31    0
Name: Label, dtype: int64
Accuracy: 0.4
Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00         5
           1       0.37      1.00      0.54         7
           2       1.00      0.12      0.22         8

    accuracy                           0.40        20
   macro avg       0.46      0.38      0.25        20
weighted avg       0.53      0.40      0.28        20



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
