In [58]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
import re
%matplotlib inline

# Cleaning the data

Now that we have seen what our datset looked like, we have to a little bit of cleaning. There are three main tasks:
1. Removing two letter words that make no sense
2. Taking care of plural and singular, and merge them
3. Merge words that are similar, i.e having the same root.

In [None]:
wordCountYear = pd.read_csv('Data/3kwordCountMonth.csv',index_col=0)
wordCountYear.index = pd.to_datetime(wordCountYear.index)

### The columns that begin by '-'

Because we wanted to keep words such as week-end, we have a certain number of columns that start with '-' and a word, we will remove the dash.

In [28]:
columnNames = wordCountYear.columns.map(lambda x: x[1:] if str(x).startswith('-') else x)
wordCountYear.columns = columnNames

And the words that end with dash.

In [38]:
otherName = wordCountYear.columns.map(lambda x: x[1:] if str(x).endswith('-') else x)
wordCountYear.columns = otherName

And we can group by the words.

In [45]:
wordCountMonthClean1 =  wordCountYear.groupby(by=wordCountYear.columns,axis=1,level=0).agg(sum)
wordCountMonthClean1.shape

(4351, 77595)

### Dealing with adverbs

We will group words with theit adverb (word and word + ement)

In [53]:
adverbs = wordCountMonthClean1.columns.map(lambda x: x[:-5] if str(x).endswith('ement') else x)
wordCountMonthClean1.columns = adverbs

In [54]:
wordCountMontClean2 = wordCountMonthClean1.groupby(by=wordCountYear.columns,axis=1,level=0).agg(sum)
wordCountMontClean2.shape

(4351, 77458)

### Merging singular and plural words

As we were not satisfied by the NLTK work on the french language, we will implement our own small algorithm to merge similar words together. We start were merging singular and plural words.

In [206]:
#Getting the words
columnNames = wordCountYear.columns.map(lambda x: str(x))

Let's look up a few of the column titles.

In [10]:
columnNames[100:120]

array(['académique', 'académiques', 'accent', 'accepte', 'accepter',
       'acceptions', 'accepté', 'acceptée', 'accessils', 'accessit',
       'accessits', 'accessoires', 'accidens', 'accident', 'accidents',
       'acclamations', 'accompagnemens', 'accompagnement', 'accompagnent',
       'accompagné'], dtype=object)

As we can see, the plural of a word comes normally just after it's singular, we therefore do not have to check all of the other words for plurals.

In [209]:
pluralWords = []

for i in range(len(columnNames)):
    wordPair = []
    col = columnNames[i]
    colPlur = col+'s'
    if (i+5) > (len(columnNames) - 1):
        lastEl =  len(columnNames) - 1
    else:
        lastEl = i+5
    for plur in columnNames[i:lastEl]:
        if colPlur == plur:
            wordPair.append(col)
            wordPair.append(plur)
            pluralWords.append(wordPair)  

print('Number of Singular-Plural word pairs : ', len(pluralWords))
#We have to remove the jouis and jouiss pairs because, jouis will already be removed with the joui pair
pluralWords.pop(1329)
pluralWords[1:10]

Number of Singular-Plural word pairs :  2712


[['abeille', 'abeilles'],
 ['abolie', 'abolies'],
 ['abondante', 'abondantes'],
 ['abonnement', 'abonnements'],
 ['abord', 'abords'],
 ['abraham', 'abrahams'],
 ['absolu', 'absolus'],
 ['abyssin', 'abyssins'],
 ['acacia', 'acacias']]

Now that we can see that our algorithm work we have to merge together our dataframe.

In [94]:
testMerge = wordCountYear.copy()
for i in tqdm(range(len(pluralWords))):
    wordSum = wordCountYear[pluralWords[i]].sum(axis=1)
    #deleting the columns
    testMerge.drop(pluralWords[i],axis=1,inplace=True)
    testMerge[pluralWords[i][0]] = wordSum

100%|██████████| 2683/2683 [59:12<00:00,  1.37s/it]


In [101]:
testMerge.shape

(2353, 28908)

We can see that our dataset is still very large, but those operations manage to reduce its size without loosing information!

We were not able to find a good function to group together words of the same group, so we tried to merge together other plurals (with x at the end, and gender words that we can put together. We also remark that we have to look at words that are longer than 3 letters because otherwise we merge too many words that make no sense.

In [203]:
cleanCol1 = testMerge.columns.map(lambda x: str(x))

In [211]:
#Here we try to group together words that are in the text in the masculine and feminine version.
family = []
for i in range(len(cleanCol1)):
    wordFamily = []
    col = testMerge.columns.values[i]
    if (i+5) > (len(columnNames) - 1):
        lastEl =  len(columnNames) - 1
    else:
        lastEl = i+5
    i=0
    colPlur = col+'e'
    for plur in cleanCol1[i+1:lastEl]:
        if (colPlur == plur) & (len(col) > 3):
            wordFamily.append(col)
            wordFamily.append(plur)
            family.append(wordFamily)

noël noële


If we take a look at a few of the data.

In [140]:
family[20:35]

[['augmenté', 'augmentée'],
 ['austro-hongrois', 'austro-hongroise'],
 ['autrich', 'autriche'],
 ['avis', 'avise'],
 ['band', 'bande'],
 ['bard', 'barde'],
 ['bart', 'barte'],
 ['barth', 'barthe'],
 ['benoît', 'benoîte'],
 ['berlinois', 'berlinoise'],
 ['bertaud', 'bertaude'],
 ['black', 'blacke'],
 ['bossu', 'bossue'],
 ['boug', 'bouge'],
 ['bourgeois', 'bourgeoise']]

We can see that for some words we have the masculine and feminine version, but other it is just the word that has a spelling mistake, so for spelling purposes we will take the feminine version for the grouping and the final word.

Now let's look at plurals finishing with x.

In [212]:
#Here we try to group together words that are in the text in the masculine and feminine version.
plurX = []
for i in range(len(cleanCol1)):
    pluxXfamily = []
    col = testMerge.columns.values[i]
    if (i+5) > (len(columnNames) - 1):
        lastEl =  len(columnNames) - 1
    else:
        lastEl = i+5
    i=0
    colPlur = col+'x'
    for plur in cleanCol1[i+1:lastEl]:
        if (colPlur == plur) & (len(col) > 3):
            pluxXfamily.append(col)
            pluxXfamily.append(plur)
            plurX.append(pluxXfamily)
            if col == 'noël':
                print(col,plur)
    
plurX[10:25]

[['drapeau', 'drapeaux'],
 ['fceau', 'fceaux'],
 ['fourneau', 'fourneaux'],
 ['genou', 'genoux'],
 ['hameau', 'hameaux'],
 ['journau', 'journaux'],
 ['manteau', 'manteaux'],
 ['marsan', 'marsanx'],
 ['merck', 'merckx'],
 ['milieu', 'milieux'],
 ['morceau', 'morceaux'],
 ['morne', 'mornex'],
 ['neveu', 'neveux'],
 ['nouveau', 'nouveaux'],
 ['peau', 'peaux']]

Now that we have words that we can merge together , we will work to combine our dataset again!

In [155]:
cleanData1 = testMerge.copy()
for i in tqdm(range(len(family))):
    wordSum = cleanData1[family[i]].sum(axis=1)
    #deleting the columns
    cleanData1.drop(family[i],axis=1,inplace=True)
    cleanData1[family[i][1]] = wordSum

100%|██████████| 553/553 [11:01<00:00,  1.18s/it]


And now we take care of the plural words with x!

In [156]:
cleanData2 = cleanData1.copy()
for i in tqdm(range(len(plurX))):
    wordSum = cleanData2[plurX[i]].sum(axis=1)
    #deleting the columns
    cleanData2.drop(plurX[i],axis=1,inplace=True)
    cleanData2[plurX[i][0]] = wordSum

100%|██████████| 44/44 [00:53<00:00,  1.12s/it]


And finally, we save the data!

In [157]:
cleanData2.to_csv('wordCount_Clean_v1.csv')

In [2]:
#reimporting to continue working with data
cleanData2 = pd.read_csv('wordCount_Clean_v1.csv',index_col=0)

### Taking care of apostrophes
One issue that we did not take care during the extraction of the data is that we did not remove apostrophe from the dataset (which we should have). We therefore want to remove them now to group together the word that was alone with the word with an apostrophe, for example l'état and état should be counted together.

In [125]:
#Finding which column titles contain an apostrophe
isApostrophe = cleanData2.columns.astype(str).map(lambda x: len(x.split("'")))
separation = cleanData2.columns.astype(str).map(lambda x: x.split("'"))

#we need a dict for the new columns
newName = dict(zip(cleanData2.columns[isApostrophe==2],
                   [e[1] for e in separation[isApostrophe==2]]))

cleanData2.rename(columns = newName,inplace=True)

And now we can groupBy.

In [129]:
cleanData2 = cleanData2.groupby(by=cleanData2.columns,axis=1,level=0).agg(sum)

We have not yet taken care of vers (that finish with er).

In [197]:
colNamesClean2 = cleanData2.columns.values
verb = []
infinitif = []
masc = []
fem = []
for i in range(len(colNamesClean2)):
    col = colNamesClean2[i]
    if (i+5) > (len(colNamesClean2) - 1):
        lastEl =  len(colNamesClean2) - 1
    else:
        lastEl = i+5
    
    colPlur = col+'r'
    colPlurE = col+'e'
    for plur in colNamesClean2[i+1:lastEl]:
        if (colPlur == plur):
            verb.append(col)
            infinitif.append(plur)

        if(colPlurE == plur):
            masc.append(col)
            fem.append(plur)
        

dictMF = dict(zip(fem,masc))
dictVerb = dict(zip(verb,infinitif))

And we use this new dict to map new columns, just as before!

In [146]:
cleanData2.rename(columns = dictVerb,inplace=True)
cleanData2.rename(columns = dictMF,inplace=True)
cleanData2 = cleanData2.groupby(by=cleanData2.columns,axis=1,level=0).agg(sum)

In [151]:
cleanData2.shape

(2353, 26481)

In [148]:
cleanData2.to_csv('wordCount_Clean_v2.csv')

We see that we have already been able to reduce the number of words by nearly 1/6 without loosing word counts.

### Dealing with missing data

One of the problem that we have for our dataset is that there are some months were a word count is missing (for some words). If we look at the general trend (for example week-end), we can assume that it is not week-end that has fallen out of interest in the journals. It's possible that we had a problem in the data acquisition. For furutre predition and visualization we are more interested in the general shape of the time series, and not the few months in between where the word is missing, therefore it could be interesting to interpolate the missing data.

Before interpolating we can start removing words that were present only in 3 month or less and nowhere else.

In [152]:
cleanData3 = cleanData2.copy()
numMonth = cleanData3.astype(bool).sum(axis=0).values

In [166]:
cleanData3.drop(cleanData3.columns[numMonth < 4].values,axis=1,inplace=True)

Now we look at how many times each word is present, and we remove the words that are only present less than 30 times. As words follow a long tail distribution, most of them will not appear very often it's hard to remove words without being scared to loose too much information.

In [216]:
totNumWords = cleanData3.sum(axis=0).values
cleanData3.drop(cleanData3.columns[totNumWords<25].values,axis=1,inplace=True)
cleanData3.rename(columns = {'noële' : 'noël'},inplace=True)

In [218]:
cleanData3.to_csv('wordCountYear_Clean_v3.csv')