# Import Data

## Présentation
Lorsqu'il est nécessaire de mettre à jour la base de données MongoDB avec de nouvelles données, il peut être pratique de supprimer tous les documents (Tweets) de la collection existante et d'insérer les nouveaux documents à partir des fichiers JSON contenu dans le dossier ``data`` à la racine du projet pour garantir la cohérence des données et s'assurer que nous ne disposons pas de données modifiées. Dans ce cas-ci, ce script peut être utilisé.

Que fait-il ?
1. Suppression des données existantes dans la base de données MongoDB ayant pour nom Tweet
2. Lecture des données présente dans le dossier ``data`` à la racine du projet. A noter que nous n'avons pas commit ces données à cause d'une taille volumineuse. Elles peuvent être téléchargées via le lien suivant : [Cliquer ici](https://www.dropbox.com/s/qfhaobip55xxkif/Tweet%20Worldcup.zip?dl=0)
3. Insertions des données lues dans la base de données

## Le code
On importe les différentes bibliothèques nécessaires.

In [81]:
import pymongo
import json
import os
from datetime import datetime
import numpy as np

Connexion à la base de données MongoDB

In [79]:
client = pymongo.MongoClient("mongodb://mongo:mongo@localhost:27017")
db = client["Tweet"]
collection = db["dataset"]

Suppression de tous les documents de la collection

In [48]:
collection.delete_many({})

<pymongo.results.DeleteResult at 0x205e2d200d0>

Définition de la fonction pour upload dans mongodb tout les documents contenus dans un fichier

In [82]:
def upload_file_data(path):
    with open(path, encoding='UTF-8') as f:
        docs = np.array([])
        for line in f:
            doc = json.loads(line)
            docs = np.append(docs, doc)
        if len(docs) > 0:
            collection.insert_many(docs.tolist())

Récupération des noms des fichiers JSON dans le dossier approprié

In [76]:
path = "../../data"
all_files = os.listdir(path)

On parcourt l'ensemble des fichiers pour upload les documents qui le constitue tout en affichant des statistiques pour suivre l'évolution de l'importation des données dans MongoDB

In [78]:
start_time = datetime.now()
print(f">>>> Start : {start_time}")

# begins upload
nb_files = len(all_files)
for idx_files, file in enumerate(all_files):
    start_file_time = datetime.now() 
    if file.endswith(".json"):
        upload_file_data(f"{path}/{file}")
    time_for_this_file = (datetime.now()  - start_file_time).seconds
    percentage = round(idx_files*100/nb_files,2)
    print(f" - Upload files n°{idx_files}/{nb_files} in {time_for_this_file}s ({percentage}%)")

# some stats
end_time = datetime.now()     
print(f">>>> End : {end_time}")
diff_time = end_time - start_time
print(f"Upload done in {diff_time.seconds // 3600}h {(diff_time.seconds // 60) % 60}min {diff_time.seconds % 60}sec ")

>>>> Start : 2023-04-23 21:02:51.453995
 - Upload files n°0/2286 in 1s (0.0%)
 - Upload files n°1/2286 in 1s (0.04%)
 - Upload files n°2/2286 in 1s (0.09%)
 - Upload files n°3/2286 in 1s (0.13%)
 - Upload files n°4/2286 in 1s (0.17%)
 - Upload files n°5/2286 in 1s (0.22%)
 - Upload files n°6/2286 in 1s (0.26%)
 - Upload files n°7/2286 in 1s (0.31%)
 - Upload files n°8/2286 in 1s (0.35%)
 - Upload files n°9/2286 in 1s (0.39%)
 - Upload files n°10/2286 in 1s (0.44%)
 - Upload files n°11/2286 in 1s (0.48%)
 - Upload files n°12/2286 in 1s (0.52%)
 - Upload files n°13/2286 in 1s (0.57%)
 - Upload files n°14/2286 in 1s (0.61%)
 - Upload files n°15/2286 in 1s (0.66%)
 - Upload files n°16/2286 in 1s (0.7%)
 - Upload files n°17/2286 in 1s (0.74%)
 - Upload files n°18/2286 in 1s (0.79%)
 - Upload files n°19/2286 in 1s (0.83%)
 - Upload files n°20/2286 in 1s (0.87%)
 - Upload files n°21/2286 in 1s (0.92%)
 - Upload files n°22/2286 in 1s (0.96%)
 - Upload files n°23/2286 in 1s (1.01%)
 - Upload fi

 - Upload files n°202/2286 in 1s (8.84%)
 - Upload files n°203/2286 in 1s (8.88%)
 - Upload files n°204/2286 in 1s (8.92%)
 - Upload files n°205/2286 in 1s (8.97%)
 - Upload files n°206/2286 in 1s (9.01%)
 - Upload files n°207/2286 in 1s (9.06%)
 - Upload files n°208/2286 in 1s (9.1%)
 - Upload files n°209/2286 in 1s (9.14%)
 - Upload files n°210/2286 in 1s (9.19%)
 - Upload files n°211/2286 in 1s (9.23%)
 - Upload files n°212/2286 in 1s (9.27%)
 - Upload files n°213/2286 in 1s (9.32%)
 - Upload files n°214/2286 in 1s (9.36%)
 - Upload files n°215/2286 in 1s (9.41%)
 - Upload files n°216/2286 in 1s (9.45%)
 - Upload files n°217/2286 in 1s (9.49%)
 - Upload files n°218/2286 in 1s (9.54%)
 - Upload files n°219/2286 in 1s (9.58%)
 - Upload files n°220/2286 in 1s (9.62%)
 - Upload files n°221/2286 in 1s (9.67%)
 - Upload files n°222/2286 in 1s (9.71%)
 - Upload files n°223/2286 in 1s (9.76%)
 - Upload files n°224/2286 in 1s (9.8%)
 - Upload files n°225/2286 in 1s (9.84%)
 - Upload files n°

 - Upload files n°399/2286 in 1s (17.45%)
 - Upload files n°400/2286 in 1s (17.5%)
 - Upload files n°401/2286 in 1s (17.54%)
 - Upload files n°402/2286 in 1s (17.59%)
 - Upload files n°403/2286 in 1s (17.63%)
 - Upload files n°404/2286 in 1s (17.67%)
 - Upload files n°405/2286 in 0s (17.72%)
 - Upload files n°406/2286 in 1s (17.76%)
 - Upload files n°407/2286 in 1s (17.8%)
 - Upload files n°408/2286 in 1s (17.85%)
 - Upload files n°409/2286 in 1s (17.89%)
 - Upload files n°410/2286 in 1s (17.94%)
 - Upload files n°411/2286 in 1s (17.98%)
 - Upload files n°412/2286 in 1s (18.02%)
 - Upload files n°413/2286 in 1s (18.07%)
 - Upload files n°414/2286 in 1s (18.11%)
 - Upload files n°415/2286 in 1s (18.15%)
 - Upload files n°416/2286 in 1s (18.2%)
 - Upload files n°417/2286 in 1s (18.24%)
 - Upload files n°418/2286 in 1s (18.29%)
 - Upload files n°419/2286 in 1s (18.33%)
 - Upload files n°420/2286 in 1s (18.37%)
 - Upload files n°421/2286 in 1s (18.42%)
 - Upload files n°422/2286 in 1s (18.

 - Upload files n°595/2286 in 1s (26.03%)
 - Upload files n°596/2286 in 1s (26.07%)
 - Upload files n°597/2286 in 1s (26.12%)
 - Upload files n°598/2286 in 1s (26.16%)
 - Upload files n°599/2286 in 1s (26.2%)
 - Upload files n°600/2286 in 1s (26.25%)
 - Upload files n°601/2286 in 1s (26.29%)
 - Upload files n°602/2286 in 1s (26.33%)
 - Upload files n°603/2286 in 1s (26.38%)
 - Upload files n°604/2286 in 1s (26.42%)
 - Upload files n°605/2286 in 1s (26.47%)
 - Upload files n°606/2286 in 1s (26.51%)
 - Upload files n°607/2286 in 1s (26.55%)
 - Upload files n°608/2286 in 1s (26.6%)
 - Upload files n°609/2286 in 1s (26.64%)
 - Upload files n°610/2286 in 1s (26.68%)
 - Upload files n°611/2286 in 1s (26.73%)
 - Upload files n°612/2286 in 1s (26.77%)
 - Upload files n°613/2286 in 1s (26.82%)
 - Upload files n°614/2286 in 1s (26.86%)
 - Upload files n°615/2286 in 1s (26.9%)
 - Upload files n°616/2286 in 1s (26.95%)
 - Upload files n°617/2286 in 1s (26.99%)
 - Upload files n°618/2286 in 1s (27.

 - Upload files n°791/2286 in 1s (34.6%)
 - Upload files n°792/2286 in 1s (34.65%)
 - Upload files n°793/2286 in 1s (34.69%)
 - Upload files n°794/2286 in 1s (34.73%)
 - Upload files n°795/2286 in 1s (34.78%)
 - Upload files n°796/2286 in 1s (34.82%)
 - Upload files n°797/2286 in 1s (34.86%)
 - Upload files n°798/2286 in 1s (34.91%)
 - Upload files n°799/2286 in 1s (34.95%)
 - Upload files n°800/2286 in 1s (35.0%)
 - Upload files n°801/2286 in 1s (35.04%)
 - Upload files n°802/2286 in 1s (35.08%)
 - Upload files n°803/2286 in 1s (35.13%)
 - Upload files n°804/2286 in 1s (35.17%)
 - Upload files n°805/2286 in 1s (35.21%)
 - Upload files n°806/2286 in 1s (35.26%)
 - Upload files n°807/2286 in 1s (35.3%)
 - Upload files n°808/2286 in 1s (35.35%)
 - Upload files n°809/2286 in 1s (35.39%)
 - Upload files n°810/2286 in 1s (35.43%)
 - Upload files n°811/2286 in 1s (35.48%)
 - Upload files n°812/2286 in 1s (35.52%)
 - Upload files n°813/2286 in 1s (35.56%)
 - Upload files n°814/2286 in 1s (35.

 - Upload files n°987/2286 in 1s (43.18%)
 - Upload files n°988/2286 in 1s (43.22%)
 - Upload files n°989/2286 in 1s (43.26%)
 - Upload files n°990/2286 in 1s (43.31%)
 - Upload files n°991/2286 in 1s (43.35%)
 - Upload files n°992/2286 in 1s (43.39%)
 - Upload files n°993/2286 in 1s (43.44%)
 - Upload files n°994/2286 in 1s (43.48%)
 - Upload files n°995/2286 in 1s (43.53%)
 - Upload files n°996/2286 in 1s (43.57%)
 - Upload files n°997/2286 in 1s (43.61%)
 - Upload files n°998/2286 in 1s (43.66%)
 - Upload files n°999/2286 in 1s (43.7%)
 - Upload files n°1000/2286 in 1s (43.74%)
 - Upload files n°1001/2286 in 1s (43.79%)
 - Upload files n°1002/2286 in 1s (43.83%)
 - Upload files n°1003/2286 in 1s (43.88%)
 - Upload files n°1004/2286 in 1s (43.92%)
 - Upload files n°1005/2286 in 1s (43.96%)
 - Upload files n°1006/2286 in 1s (44.01%)
 - Upload files n°1007/2286 in 1s (44.05%)
 - Upload files n°1008/2286 in 1s (44.09%)
 - Upload files n°1009/2286 in 1s (44.14%)
 - Upload files n°1010/22

 - Upload files n°1179/2286 in 1s (51.57%)
 - Upload files n°1180/2286 in 1s (51.62%)
 - Upload files n°1181/2286 in 1s (51.66%)
 - Upload files n°1182/2286 in 1s (51.71%)
 - Upload files n°1183/2286 in 0s (51.75%)
 - Upload files n°1184/2286 in 1s (51.79%)
 - Upload files n°1185/2286 in 1s (51.84%)
 - Upload files n°1186/2286 in 1s (51.88%)
 - Upload files n°1187/2286 in 1s (51.92%)
 - Upload files n°1188/2286 in 1s (51.97%)
 - Upload files n°1189/2286 in 1s (52.01%)
 - Upload files n°1190/2286 in 1s (52.06%)
 - Upload files n°1191/2286 in 1s (52.1%)
 - Upload files n°1192/2286 in 1s (52.14%)
 - Upload files n°1193/2286 in 1s (52.19%)
 - Upload files n°1194/2286 in 1s (52.23%)
 - Upload files n°1195/2286 in 1s (52.27%)
 - Upload files n°1196/2286 in 1s (52.32%)
 - Upload files n°1197/2286 in 0s (52.36%)
 - Upload files n°1198/2286 in 1s (52.41%)
 - Upload files n°1199/2286 in 1s (52.45%)
 - Upload files n°1200/2286 in 1s (52.49%)
 - Upload files n°1201/2286 in 1s (52.54%)
 - Upload fi

 - Upload files n°1370/2286 in 1s (59.93%)
 - Upload files n°1371/2286 in 1s (59.97%)
 - Upload files n°1372/2286 in 1s (60.02%)
 - Upload files n°1373/2286 in 1s (60.06%)
 - Upload files n°1374/2286 in 1s (60.1%)
 - Upload files n°1375/2286 in 1s (60.15%)
 - Upload files n°1376/2286 in 1s (60.19%)
 - Upload files n°1377/2286 in 1s (60.24%)
 - Upload files n°1378/2286 in 1s (60.28%)
 - Upload files n°1379/2286 in 1s (60.32%)
 - Upload files n°1380/2286 in 1s (60.37%)
 - Upload files n°1381/2286 in 1s (60.41%)
 - Upload files n°1382/2286 in 1s (60.45%)
 - Upload files n°1383/2286 in 1s (60.5%)
 - Upload files n°1384/2286 in 1s (60.54%)
 - Upload files n°1385/2286 in 1s (60.59%)
 - Upload files n°1386/2286 in 2s (60.63%)
 - Upload files n°1387/2286 in 1s (60.67%)
 - Upload files n°1388/2286 in 1s (60.72%)
 - Upload files n°1389/2286 in 1s (60.76%)
 - Upload files n°1390/2286 in 1s (60.8%)
 - Upload files n°1391/2286 in 1s (60.85%)
 - Upload files n°1392/2286 in 1s (60.89%)
 - Upload file

 - Upload files n°1562/2286 in 1s (68.33%)
 - Upload files n°1563/2286 in 1s (68.37%)
 - Upload files n°1564/2286 in 1s (68.42%)
 - Upload files n°1565/2286 in 1s (68.46%)
 - Upload files n°1566/2286 in 1s (68.5%)
 - Upload files n°1567/2286 in 1s (68.55%)
 - Upload files n°1568/2286 in 1s (68.59%)
 - Upload files n°1569/2286 in 1s (68.64%)
 - Upload files n°1570/2286 in 1s (68.68%)
 - Upload files n°1571/2286 in 1s (68.72%)
 - Upload files n°1572/2286 in 1s (68.77%)
 - Upload files n°1573/2286 in 1s (68.81%)
 - Upload files n°1574/2286 in 1s (68.85%)
 - Upload files n°1575/2286 in 1s (68.9%)
 - Upload files n°1576/2286 in 1s (68.94%)
 - Upload files n°1577/2286 in 1s (68.99%)
 - Upload files n°1578/2286 in 1s (69.03%)
 - Upload files n°1579/2286 in 1s (69.07%)
 - Upload files n°1580/2286 in 1s (69.12%)
 - Upload files n°1581/2286 in 1s (69.16%)
 - Upload files n°1582/2286 in 1s (69.2%)
 - Upload files n°1583/2286 in 1s (69.25%)
 - Upload files n°1584/2286 in 1s (69.29%)
 - Upload file

 - Upload files n°1754/2286 in 1s (76.73%)
 - Upload files n°1755/2286 in 1s (76.77%)
 - Upload files n°1756/2286 in 1s (76.82%)
 - Upload files n°1757/2286 in 1s (76.86%)
 - Upload files n°1758/2286 in 1s (76.9%)
 - Upload files n°1759/2286 in 1s (76.95%)
 - Upload files n°1760/2286 in 1s (76.99%)
 - Upload files n°1761/2286 in 1s (77.03%)
 - Upload files n°1762/2286 in 1s (77.08%)
 - Upload files n°1763/2286 in 1s (77.12%)
 - Upload files n°1764/2286 in 1s (77.17%)
 - Upload files n°1765/2286 in 1s (77.21%)
 - Upload files n°1766/2286 in 1s (77.25%)
 - Upload files n°1767/2286 in 1s (77.3%)
 - Upload files n°1768/2286 in 1s (77.34%)
 - Upload files n°1769/2286 in 0s (77.38%)
 - Upload files n°1770/2286 in 1s (77.43%)
 - Upload files n°1771/2286 in 1s (77.47%)
 - Upload files n°1772/2286 in 1s (77.52%)
 - Upload files n°1773/2286 in 1s (77.56%)
 - Upload files n°1774/2286 in 1s (77.6%)
 - Upload files n°1775/2286 in 0s (77.65%)
 - Upload files n°1776/2286 in 1s (77.69%)
 - Upload file

 - Upload files n°1946/2286 in 1s (85.13%)
 - Upload files n°1947/2286 in 1s (85.17%)
 - Upload files n°1948/2286 in 1s (85.21%)
 - Upload files n°1949/2286 in 1s (85.26%)
 - Upload files n°1950/2286 in 1s (85.3%)
 - Upload files n°1951/2286 in 1s (85.35%)
 - Upload files n°1952/2286 in 1s (85.39%)
 - Upload files n°1953/2286 in 1s (85.43%)
 - Upload files n°1954/2286 in 1s (85.48%)
 - Upload files n°1955/2286 in 1s (85.52%)
 - Upload files n°1956/2286 in 1s (85.56%)
 - Upload files n°1957/2286 in 1s (85.61%)
 - Upload files n°1958/2286 in 1s (85.65%)
 - Upload files n°1959/2286 in 1s (85.7%)
 - Upload files n°1960/2286 in 1s (85.74%)
 - Upload files n°1961/2286 in 1s (85.78%)
 - Upload files n°1962/2286 in 1s (85.83%)
 - Upload files n°1963/2286 in 1s (85.87%)
 - Upload files n°1964/2286 in 1s (85.91%)
 - Upload files n°1965/2286 in 1s (85.96%)
 - Upload files n°1966/2286 in 1s (86.0%)
 - Upload files n°1967/2286 in 1s (86.05%)
 - Upload files n°1968/2286 in 1s (86.09%)
 - Upload file

 - Upload files n°2137/2286 in 1s (93.48%)
 - Upload files n°2138/2286 in 1s (93.53%)
 - Upload files n°2139/2286 in 1s (93.57%)
 - Upload files n°2140/2286 in 1s (93.61%)
 - Upload files n°2141/2286 in 1s (93.66%)
 - Upload files n°2142/2286 in 1s (93.7%)
 - Upload files n°2143/2286 in 1s (93.74%)
 - Upload files n°2144/2286 in 1s (93.79%)
 - Upload files n°2145/2286 in 1s (93.83%)
 - Upload files n°2146/2286 in 1s (93.88%)
 - Upload files n°2147/2286 in 1s (93.92%)
 - Upload files n°2148/2286 in 1s (93.96%)
 - Upload files n°2149/2286 in 1s (94.01%)
 - Upload files n°2150/2286 in 1s (94.05%)
 - Upload files n°2151/2286 in 1s (94.09%)
 - Upload files n°2152/2286 in 1s (94.14%)
 - Upload files n°2153/2286 in 1s (94.18%)
 - Upload files n°2154/2286 in 1s (94.23%)
 - Upload files n°2155/2286 in 1s (94.27%)
 - Upload files n°2156/2286 in 1s (94.31%)
 - Upload files n°2157/2286 in 1s (94.36%)
 - Upload files n°2158/2286 in 1s (94.4%)
 - Upload files n°2159/2286 in 0s (94.44%)
 - Upload fil