In [77]:
import pandas as pd
import numpy as np
from tqdm import tqdm

# Cleaning the data

Now that we have seen what our datset looked like, we have to a little bit of cleaning. There are three main tasks:
1. Removing two letter words that make no sense
2. Taking care of plural and singular, and merge them
3. Merge words that are similar, i.e having the same root.

In [2]:
wordCountYear = pd.read_csv('Data/wordCountYear.csv',index_col=0)
wordCountYear.index = pd.to_datetime(wordCountYear.index)

### Removing the two letter words

We want to keep the "or" word so we keep it, but all the other existing two letter words in french we can remove (thanks Scrabble two letter word list).

In [3]:
dfOR = wordCountYear['or']

Now we find the columns where the words are of length one or 2

In [4]:
wordLength = wordCountYear.columns.map(lambda x: len(str(x)))
wordCountYear.columns[wordLength<=2]

Index(['ab', 'ac', 'ad', 'ae', 'af', 'ag', 'ah', 'aj', 'al', 'am',
       ...
       'és', 'ét', 'év', 'îa', 'îe', 'îi', 'îo', 'îr', 'îç', 'îî'],
      dtype='object', length=500)

And now we drop those columns.

In [5]:
wordCountYear = wordCountYear.drop(wordCountYear.columns[wordLength<=2],axis=1)

We managed to drop 500 columns, now we can add the or word that we want to keep.

In [6]:
wordCountYear['or'] = dfOR

### Merging singular and plural words

As we were not satisfied by the NLTK work on the french language, we will implement our own small algorithm to merge similar words together. We start were merging singular and plural words.

In [7]:
#Getting the words
columnNames = wordCountYear.columns.map(lambda x: str(x))

Let's look up a few of the column titles.

In [10]:
columnNames[100:120]

array(['académique', 'académiques', 'accent', 'accepte', 'accepter',
       'acceptions', 'accepté', 'acceptée', 'accessils', 'accessit',
       'accessits', 'accessoires', 'accidens', 'accident', 'accidents',
       'acclamations', 'accompagnemens', 'accompagnement', 'accompagnent',
       'accompagné'], dtype=object)

As we can see, the plural of a word comes normally just after it's singular, we therefore do not have to check all of the other words for plurals.

In [103]:
pluralWords = []

for i in range(len(columnNames)):
    wordPair = []
    col = columnNames[i]
    colPlur = col+'s'
    if (i+5) > (len(columnNames) - 1):
        lastEl =  len(columnNames) - 1
    else:
        lastEl = i+5
    for plur in columnNames[i:lastEl]:
        if colPlur == plur:
            wordPair.append(col)
            wordPair.append(plur)
            pluralWords.append(wordPair)

print('Number of Singular-Plural word pairs : ', len(pluralWords))
#We have to remove the jouis and jouiss pairs because, jouis will already be removed with the joui pair
pluralWords.pop(1329)
pluralWords[1:10]

Number of Singular-Plural word pairs :  2684


[['abeille', 'abeilles'],
 ['abolie', 'abolies'],
 ['abondante', 'abondantes'],
 ['abonnement', 'abonnements'],
 ['abord', 'abords'],
 ['abraham', 'abrahams'],
 ['absolu', 'absolus'],
 ['abyssin', 'abyssins'],
 ['acacia', 'acacias']]

Now that we can see that our algorithm work we have to merge together our dataframe.

In [94]:
testMerge = wordCountYear.copy()
for i in tqdm(range(len(pluralWords))):
    wordSum = wordCountYear[pluralWords[i]].sum(axis=1)
    #deleting the columns
    testMerge.drop(pluralWords[i],axis=1,inplace=True)
    testMerge[pluralWords[i][0]] = wordSum

100%|██████████| 2683/2683 [59:12<00:00,  1.37s/it]


In [101]:
testMerge.shape

(2353, 28908)

We can see that our dataset is still very large, but those operations manage to reduce its size without loosing information!

We were not able to find a good function to group together words of the same group, so we tried to merge together other plurals (with x at the end, and gender words that we can put together. We also remark that we have to look at words that are longer than 3 letters because otherwise we merge too many words that make no sense.

In [143]:
cleanCol1 = testMerge.columns.map(lambda x: str(x))

In [137]:
#Here we try to group together words that are in the text in the masculine and feminine version.
family = []
for i in range(len(cleanCol1)):
    wordFamily = []
    col = testMerge.columns.values[i]
    if (i+5) > (len(columnNames) - 1):
        lastEl =  len(columnNames) - 1
    else:
        lastEl = i+5
    i=0
    colPlur = col+'e'
    for plur in cleanCol1[i+1:lastEl]:
        if (colPlur == plur) & (len(col) > 3):
            wordFamily.append(col)
            wordFamily.append(plur)
            family.append(wordFamily)

If we take a look at a few of the data.

In [140]:
family[20:35]

[['augmenté', 'augmentée'],
 ['austro-hongrois', 'austro-hongroise'],
 ['autrich', 'autriche'],
 ['avis', 'avise'],
 ['band', 'bande'],
 ['bard', 'barde'],
 ['bart', 'barte'],
 ['barth', 'barthe'],
 ['benoît', 'benoîte'],
 ['berlinois', 'berlinoise'],
 ['bertaud', 'bertaude'],
 ['black', 'blacke'],
 ['bossu', 'bossue'],
 ['boug', 'bouge'],
 ['bourgeois', 'bourgeoise']]

We can see that for some words we have the masculine and feminine version, but other it is just the word that has a spelling mistake, so for spelling purposes we will take the feminine version for the grouping and the final word.

Now let's look at plurals finishing with x.

In [145]:
#Here we try to group together words that are in the text in the masculine and feminine version.
plurX = []
for i in range(len(cleanCol1)):
    pluxXfamily = []
    col = testMerge.columns.values[i]
    if (i+5) > (len(columnNames) - 1):
        lastEl =  len(columnNames) - 1
    else:
        lastEl = i+5
    i=0
    colPlur = col+'x'
    for plur in cleanCol1[i+1:lastEl]:
        if (colPlur == plur) & (len(col) > 3):
            pluxXfamily.append(col)
            pluxXfamily.append(plur)
            plurX.append(pluxXfamily)
    
plurX[10:25]

[['dieu', 'dieux'],
 ['drapeau', 'drapeaux'],
 ['fceau', 'fceaux'],
 ['fourneau', 'fourneaux'],
 ['genou', 'genoux'],
 ['hameau', 'hameaux'],
 ['journau', 'journaux'],
 ['manteau', 'manteaux'],
 ['marsan', 'marsanx'],
 ['mati', 'matix'],
 ['merck', 'merckx'],
 ['milieu', 'milieux'],
 ['morceau', 'morceaux'],
 ['morne', 'mornex'],
 ['neveu', 'neveux']]

Now that we have words that we can merge together , we will work to combine our dataset again!

In [155]:
cleanData1 = testMerge.copy()
for i in tqdm(range(len(family))):
    wordSum = cleanData1[family[i]].sum(axis=1)
    #deleting the columns
    cleanData1.drop(family[i],axis=1,inplace=True)
    cleanData1[family[i][1]] = wordSum

100%|██████████| 553/553 [11:01<00:00,  1.18s/it]


And now we take care of the plural words with x!

In [156]:
cleanData2 = cleanData1.copy()
for i in tqdm(range(len(plurX))):
    wordSum = cleanData2[plurX[i]].sum(axis=1)
    #deleting the columns
    cleanData2.drop(plurX[i],axis=1,inplace=True)
    cleanData2[plurX[i][0]] = wordSum

100%|██████████| 44/44 [00:53<00:00,  1.12s/it]


And finally, we save the data!

In [157]:
cleanData2.to_csv('wordCount_Clean_v1.csv')

### Dealing with missing data

One of the problem that we have for our dataset is that there are some months were a word count is missing (for some words). If we look at the general trend (for example week-end), we can assume that it is not week-end that has fallen out of interest in the journals. It's possible that we had a problem in the data acquisition. For furutre predition and visualization we are more interested in the general shape of the time series, and not the few months in between where the word is missing, therefore it could be interesting to interpolate the missing data.