# Data Identification

## Remember. Persistence reveals the path.

For the test we have part of the data from ***A Multilingual, Multi-Style and Multi-Granularity Dataset for Cross-Language Textual Similarity Detection*** *Jeremy Ferrero, Frederic Agnes, Laurent Besacier, Didier Schwab*.

Every group of document are separate by category. We have 4 categories:
- APR. Amazon Products Reviews.
- Conference Papers.
- PAN11. Texts come from books freely available on the Gutenberg Project website
- Wikipedia.

Every category has different folders for each language. This provied data will train the prediction model for **text classification** and **topic model**, so we must determine if all the folders contains enought documents in every language and if they are correct.

You could find **cleaning code** in the jupyter notebook file *02_clean_and_language_detection* 

### 0. Imports
Imports the libraries we are going to use in this part

In [3]:
# Imports
import pandas as pd
import glob, os

# 1. APR

   ### 1.1. English
   
   We are going to make a count of the original files in the different folders. I will repeit the count after cleaning to determine if we have enought data for the models.

In [4]:
# Original documents in folder
path = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/APR/en'
print(len([f for f in glob.glob(path + '/*.txt')]))
#print([f for f in glob.glob(path + '/*.txt')])


3600


In [6]:
# After cleaning
path = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/APR/en'
print(len([f for f in glob.glob(path + '/*.txt')]))

3585


   ### 1.2. French

In [8]:
# Original documents in folder
path = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/APR/fr'
print(len([f for f in glob.glob(path + '/*.txt')]))


2400


In [9]:
# After cleaning
path = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/APR/fr'
print(len([f for f in glob.glob(path + '/*.txt')]))


2374


# 2. Conference_papers

   ### 2.1. English

In [378]:
# Original documents in folder
path = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/Conference_papers/en'
print(len([f for f in glob.glob(path + '/*.txt')]))

372


In [4]:
# After cleaning
path = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/Conference_papers/en'
print(len([f for f in glob.glob(path + '/*.txt')]))

372


   ### 2.2. French

In [379]:
# Original documents in folder
path = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/Conference_papers/fr'
print(len([f for f in glob.glob(path + '/*.txt')]))

248


In [7]:
# After cleaning
path = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/Conference_papers/fr'
print(len([f for f in glob.glob(path + '/*.txt')]))

248


# 3. PAN11

   ### 3.1. English

In [7]:
# Original documents in folder
path = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/PAN11/en'
print(len([f for f in glob.glob(path + '/*.txt')]))

1752


In [8]:
# After cleaning
path = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/PAN11/en'
print(len([f for f in glob.glob(path + '/*.txt')]))

1752


   ### 3.2. Spanish


In [14]:
# Original documents in folder
path = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/PAN11/es'
print(len([f for f in glob.glob(path + '/*.txt')]))

1168


In [9]:
# After cleaning
path = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/PAN11/es'
print(len([f for f in glob.glob(path + '/*.txt')]))

1168


# 4. Wikipedia

### 4.1. English

In [12]:
# Original documents in folder
path = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/Wikipedia/en'
print(len([f for f in glob.glob(path + '/*.txt')]))

4000


In [21]:
# After cleaning
path = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/Wikipedia/en'
print(len([f for f in glob.glob(path + '/*.txt')]))

3940


### 4.2. Spanish

In [13]:
# Original documents in folder
path = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/Wikipedia/es'
print(len([f for f in glob.glob(path + '/*.txt')]))

4000


In [19]:
# After cleaning
path = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/Wikipedia/es'
print(len([f for f in glob.glob(path + '/*.txt')]))

3829


### 4.3. French

In [14]:
# Original documents in folder
path = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/Wikipedia/fr'
print(len([f for f in glob.glob(path + '/*.txt')]))

5588


In [20]:
# After cleaning
path = '/Users/Natalio/Desktop/nlp_associate_ds_test/NLP_Associate_DS_Test/data/documents_challenge/Wikipedia/fr'
print(len([f for f in glob.glob(path + '/*.txt')]))

5341


With Python we have a lot of possibilities to represent the data and make it more easy to undersand. Here, there is an example of how to convert a dictionary to a pandas dataframe with the results of the data identification.

In [14]:
dic = {'APR_original':{'English': 3600.0, 'French':2400.0},
       'APR_clean': {'English':3585.0 , 'French':2374.0},
       'Conference_papers_original':{'English': 372.0, 'French':248.0},
       'Conference_papers_clean':{'English': 372.0 ,'French':248.0},
       'PAN11_original':{'English': 1752.0, 'Spanish': 1168.0},
       'PAN11_clean':{'English': 1752.0, 'Spanish': 1168.0},
       'Wikipedia_original':{'English': 4000.0, 'French': 5588.0, 'Spanish': 4000.0},
       'Wikipedia_clean':{'English':3940.0, 'French': 5341.0, 'Spanish':3829.0}
                      }
df = pd.DataFrame.from_dict(dic, orient='index')
display(df)               

Unnamed: 0,English,French,Spanish
APR_original,3600.0,2400.0,
APR_clean,3585.0,2374.0,
Conference_papers_original,372.0,248.0,
Conference_papers_clean,372.0,248.0,
PAN11_original,1752.0,,1168.0
PAN11_clean,1752.0,,1168.0
Wikipedia_original,4000.0,5588.0,4000.0
Wikipedia_clean,3940.0,5341.0,3829.0


## 5. Conclusions

We have enought data to train the model, but we have different amount of files so we need to be carefoul because one category can shadow the rest of the labels.