DESCRIPTION

Un ETL (Extract - Transform - Load ) est un processus qui permet de collecter les données de plusieurs sources, d’y effectuer des transformations et de les charger dans un data warehouse, une base de données, un datalake ou autre… L’intérêt de la mise en place de tel système est  donc de centraliser et consolider les données un seul référentiel fiable utile pour des analyses permettant de faciliter la prise de décisions.
Dans ce challenge, votre objectif est d'utiliser les techniques et outils de data engineering pour construire un ETL qui permettrait de collecter, traiter et stocker les articles_obam_obam issue de la plateforme du média américain The New York Times. 
L’objectif pour nous, est de trouver le meilleur candidat pour le l’offre de stage Génération automatique d’analyses économiques et financières et non d’avoir le meilleur ETL. Pour cela, nous évaluons la capacité des candidats à aborder un problème de data engineering assez complexe. 

Plusieurs axes de notation sont pris en compte pour l’évaluation de votre travail. Il s’agit de:
la méthodologie de travail;
la qualité du code;
l’environnement de travail;
la rigueur et la logique du travail.

CONSIGNES
Le travail à faire consiste en la mise en place d’un ETL.
La source de données est l’API du média The New York Times.
La fréquence de récupération des données est quotidienne (exemple: 0 8 * * *)
Le stockage final est votre système de fichier. Vous devrez mettre en place une arborescence logique en fonction de la date des articles_obam_obam récupérés.
L’étape de transformation consistera à
Nettoyer les articles;
Enrichir les articles avec un modèle d’analyse de sentiment
NB: Python est le langage de programmation requis pour la réalisation de ce projet.
Exemple d’outils utiles: Airflow, pylint, git, pandas, …

LIVRABLE
Le livrable que vous devez nous soumettre est l’url du repo github du projet que vous aurez pris le soins de bien documenter.
DEADLINE

Le projet doit être soumis au plus tard le 28 février 2023 à 23:59 GMT.


# MISE EN PLACE D'UN EXTRACT - TRANSFORM - LOAD (ETL)

In [1]:
%pip install textBlob
%pip install pynytimes
%pip install nytimesarticle

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


# EXTRACT

In [1]:
# Api du media The New York Times

import requests
import pandas as pd
import nltk
from textblob import TextBlob
import os
from datetime import datetime, timedelta

In [2]:
# requête pour récupérer les articles du New York Times

#librairie pour récupérer les articles du New York Times
from nytimesarticle import articleAPI
from datetime import datetime, timedelta

def get_request(api_key, query, begin_date, end_date):
    begin_date_str = "" # exemple de chaîne de caractères pour une date de début
    end_date_str = "" # exemple de chaîne de caractères pour une date de fin

    begin_date = datetime.strptime("20230217", "%Y%m%d").date()
    end_date = datetime.strptime("20230227", "%Y%m%d").date()   

    article_obam = requests.get(api_key, query)
    if article_obam.status_code == 200:
        return article_obam.json()
    else:
        print(f"Error: {article_obam.status_code}")
        return None

In [3]:
# requête pour récupérer les articles du New York Times
from pynytimes import NYTAPI

key = "eujZVG99yNWDCAneFIsuUlxMZIbSAvwF"
def get_extract(key):
    nyt = NYTAPI(key)
    begin_date_str = "20220101" # exemple de chaîne de caractères pour une date de début
    end_date_str = "20220201" # exemple de chaîne de caractères pour une date de fin

    begin_date = datetime.strptime(begin_date_str, "%Y%m%d").date() # conversion de la chaîne de caractères en objet date
    end_date = datetime.strptime(end_date_str, "%Y%m%d").date() # conversion de la chaîne de caractères en objet date

    articles_obam = nyt.article_search(
        query="Obama",
        dates={"begin": begin_date, "end": end_date},
        results=43,
        options={"sort": "newest"}
    )
    return articles_obam


In [4]:
articles_obam = get_extract(key)

In [5]:
# Affichage des articles_obam

for article in articles_obam:
    print(article["headline"]["main"])
    print(article["snippet"])
    print(article["web_url"])
    print()

Spotify Backs Joe Rogan’s Disinformation Machine
The streaming service picks Joe Rogan over Neil Young and Joni Mitchell.
https://www.nytimes.com/2022/02/01/opinion/spotify-joe-rogan-disinformation.html

A Race to Rethink Care After a Dire Diagnosis
With the backing of venture capital and well-known tech investors, Synapticure seeks to fill in the gaps in care and research for those with amyotrophic lateral sclerosis.
https://www.nytimes.com/2022/02/01/business/als-synapticure-startup.html

Transcript: Ezra Klein Interviews Amanda Litman
A conversation with the co-founder of Run for Something
https://www.nytimes.com/2022/02/01/podcasts/transcript-ezra-klein-interviews-amanda-litman.html

States Are Complicating Corporate Pandemic Planning
Companies are wrestling with an array of state rules that makes designing pandemic policies tricky.
https://www.nytimes.com/2022/02/01/business/dealbook/florida-texas-vaccine-mandates.html

U.S. and Allies Close to Reviving Nuclear Deal With Iran, Off

In [6]:
# Enregistrement des articles_obam dans un fichier json

import json
with open('articles_obam.json', 'w') as f:
    json.dump(articles_obam, f)

In [7]:
print(dir(articles_obam[0]))

['__class__', '__class_getitem__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__ior__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__or__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__ror__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'clear', 'copy', 'fromkeys', 'get', 'items', 'keys', 'pop', 'popitem', 'setdefault', 'update', 'values']


# TRANSFORM

In [8]:
# Convertir les données en DataFrame pandas
import pandas as pd

articles_obam = pd.DataFrame(articles_obam)
articles_obam.head()


Unnamed: 0,abstract,web_url,snippet,lead_paragraph,print_section,print_page,source,multimedia,headline,keywords,pub_date,document_type,news_desk,section_name,byline,type_of_material,_id,word_count,uri,subsection_name
0,The streaming service picks Joe Rogan over Nei...,https://www.nytimes.com/2022/02/01/opinion/spo...,The streaming service picks Joe Rogan over Nei...,The streaming service Spotify would like us to...,SR,6.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'Spotify Backs Joe Rogan’s Disinforma...,"[{'name': 'persons', 'value': 'Rogan, Joe', 'r...",2022-02-01T23:50:50+0000,article,Editorial,Opinion,"{'original': 'By Greg Bensinger', 'person': [{...",Op-Ed,nyt://article/a86e6e3c-7687-589a-b27a-6b570784...,957,nyt://article/a86e6e3c-7687-589a-b27a-6b570784...,
1,With the backing of venture capital and well-k...,https://www.nytimes.com/2022/02/01/business/al...,With the backing of venture capital and well-k...,"In August 2017, Brian Wallach’s notion of time...",B,1.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'A Race to Rethink Care After a Dire ...,"[{'name': 'subject', 'value': 'Amyotrophic Lat...",2022-02-01T16:33:49+0000,article,Business,Business Day,"{'original': 'By Maureen Farrell', 'person': [...",News,nyt://article/d2a78924-19dc-551d-b2dd-665e27aa...,1375,nyt://article/d2a78924-19dc-551d-b2dd-665e27aa...,
2,A conversation with the co-founder of Run for ...,https://www.nytimes.com/2022/02/01/podcasts/tr...,A conversation with the co-founder of Run for ...,"Every Tuesday and Friday, Ezra Klein invites y...",,,The New York Times,[],{'main': 'Transcript: Ezra Klein Interviews Am...,"[{'name': 'persons', 'value': 'Klein, Ezra', '...",2022-02-01T15:09:47+0000,article,OpEd,Podcasts,"{'original': None, 'person': [], 'organization...",Op-Ed,nyt://article/94579bd5-6b92-5342-9fae-d2295bf8...,12451,nyt://article/94579bd5-6b92-5342-9fae-d2295bf8...,
3,Companies are wrestling with an array of state...,https://www.nytimes.com/2022/02/01/business/de...,Companies are wrestling with an array of state...,Now that the Biden administration’s nationwide...,,,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'States Are Complicating Corporate Pa...,"[{'name': 'organizations', 'value': 'Carlyle G...",2022-02-01T12:33:37+0000,article,Business,Business Day,"{'original': 'By Andrew Ross Sorkin, Jason Kar...",News,nyt://article/b22a88ad-1ede-5a11-b043-fb42e13b...,1807,nyt://article/b22a88ad-1ede-5a11-b043-fb42e13b...,DealBook
4,"A return to a 2015 accord is on the table, but...",https://www.nytimes.com/2022/01/31/us/politics...,"A return to a 2015 accord is on the table, but...",WASHINGTON — The United States and its Europea...,A,8.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'U.S. and Allies Close to Reviving Nu...,"[{'name': 'subject', 'value': 'Nuclear Weapons...",2022-02-01T00:27:13+0000,article,Washington,U.S.,"{'original': 'By David E. Sanger, Lara Jakes a...",News,nyt://article/1f993990-0666-5f54-9de8-957abfcb...,1317,nyt://article/1f993990-0666-5f54-9de8-957abfcb...,Politics


## Description de chaques colonnes du dataframe
#### abstract : un résumé de l'article
#### web_url : l'URL de l'article sur le site web du New York Times
#### snippet : un court extrait de l'article
#### lead_paragraph : le premier paragraphe de l'article
#### source : la source de l'article
#### multimedia : les médias associés à l'article (photos, vidéos, etc.)
#### headline : le titre de l'article
#### keywords : les mots-clés associés à l'article
#### pub_date : la date de publication de l'article
#### document_type : le type de document (généralement "article")
#### news_desk : le domaine de couverture de l'article (par exemple "politique", "sports", etc.)
#### section_name : la section de l'article sur le site web du New York Times
#### subsection_name : la sous-section de l'article sur le site web du New York Times
#### byline : l'auteur de l'article
#### type_of_material : le type de contenu de l'article (par exemple "news", "opinion", etc.)
#### _id : l'identifiant unique de l'article
#### word_count : le nombre de mots dans l'article
#### uri : l'URI de l'article
#### print_section : la section de l'article dans la version imprimée du journal
#### print_page : la page de l'article dans la version imprimée du journal
#### sentiment : la polarité du sentiment de l'article (calculé à l'aide d'un modèle d'analyse de sentiment)


In [9]:
# compter le nombre de NAN dans chaque colonne

articles_obam.isnull().sum()

abstract             0
web_url              0
snippet              0
lead_paragraph       0
print_section       15
print_page          15
source               0
multimedia           0
headline             0
keywords             0
pub_date             0
document_type        0
news_desk            0
section_name         0
byline               0
type_of_material     0
_id                  0
word_count           0
uri                  0
subsection_name     20
dtype: int64

In [10]:
# Remplacement des NaN par une chaîne vide
articles_obam.fillna('', inplace=True)

In [11]:
articles_obam.head()

Unnamed: 0,abstract,web_url,snippet,lead_paragraph,print_section,print_page,source,multimedia,headline,keywords,pub_date,document_type,news_desk,section_name,byline,type_of_material,_id,word_count,uri,subsection_name
0,The streaming service picks Joe Rogan over Nei...,https://www.nytimes.com/2022/02/01/opinion/spo...,The streaming service picks Joe Rogan over Nei...,The streaming service Spotify would like us to...,SR,6.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'Spotify Backs Joe Rogan’s Disinforma...,"[{'name': 'persons', 'value': 'Rogan, Joe', 'r...",2022-02-01T23:50:50+0000,article,Editorial,Opinion,"{'original': 'By Greg Bensinger', 'person': [{...",Op-Ed,nyt://article/a86e6e3c-7687-589a-b27a-6b570784...,957,nyt://article/a86e6e3c-7687-589a-b27a-6b570784...,
1,With the backing of venture capital and well-k...,https://www.nytimes.com/2022/02/01/business/al...,With the backing of venture capital and well-k...,"In August 2017, Brian Wallach’s notion of time...",B,1.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'A Race to Rethink Care After a Dire ...,"[{'name': 'subject', 'value': 'Amyotrophic Lat...",2022-02-01T16:33:49+0000,article,Business,Business Day,"{'original': 'By Maureen Farrell', 'person': [...",News,nyt://article/d2a78924-19dc-551d-b2dd-665e27aa...,1375,nyt://article/d2a78924-19dc-551d-b2dd-665e27aa...,
2,A conversation with the co-founder of Run for ...,https://www.nytimes.com/2022/02/01/podcasts/tr...,A conversation with the co-founder of Run for ...,"Every Tuesday and Friday, Ezra Klein invites y...",,,The New York Times,[],{'main': 'Transcript: Ezra Klein Interviews Am...,"[{'name': 'persons', 'value': 'Klein, Ezra', '...",2022-02-01T15:09:47+0000,article,OpEd,Podcasts,"{'original': None, 'person': [], 'organization...",Op-Ed,nyt://article/94579bd5-6b92-5342-9fae-d2295bf8...,12451,nyt://article/94579bd5-6b92-5342-9fae-d2295bf8...,
3,Companies are wrestling with an array of state...,https://www.nytimes.com/2022/02/01/business/de...,Companies are wrestling with an array of state...,Now that the Biden administration’s nationwide...,,,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'States Are Complicating Corporate Pa...,"[{'name': 'organizations', 'value': 'Carlyle G...",2022-02-01T12:33:37+0000,article,Business,Business Day,"{'original': 'By Andrew Ross Sorkin, Jason Kar...",News,nyt://article/b22a88ad-1ede-5a11-b043-fb42e13b...,1807,nyt://article/b22a88ad-1ede-5a11-b043-fb42e13b...,DealBook
4,"A return to a 2015 accord is on the table, but...",https://www.nytimes.com/2022/01/31/us/politics...,"A return to a 2015 accord is on the table, but...",WASHINGTON — The United States and its Europea...,A,8.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'U.S. and Allies Close to Reviving Nu...,"[{'name': 'subject', 'value': 'Nuclear Weapons...",2022-02-01T00:27:13+0000,article,Washington,U.S.,"{'original': 'By David E. Sanger, Lara Jakes a...",News,nyt://article/1f993990-0666-5f54-9de8-957abfcb...,1317,nyt://article/1f993990-0666-5f54-9de8-957abfcb...,Politics


In [12]:
# Compter à nouveau le nombre de NAN dans chaque colonne
articles_obam.isnull().sum()

abstract            0
web_url             0
snippet             0
lead_paragraph      0
print_section       0
print_page          0
source              0
multimedia          0
headline            0
keywords            0
pub_date            0
document_type       0
news_desk           0
section_name        0
byline              0
type_of_material    0
_id                 0
word_count          0
uri                 0
subsection_name     0
dtype: int64

In [13]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Liste des stopwords
stopwords = stopwords.words('english')
stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\NHOURA\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [14]:
# Voir les colonnes inutiles du DataFrame dans l'analyse de sentiments
articles_obam.columns

Index(['abstract', 'web_url', 'snippet', 'lead_paragraph', 'print_section',
       'print_page', 'source', 'multimedia', 'headline', 'keywords',
       'pub_date', 'document_type', 'news_desk', 'section_name', 'byline',
       'type_of_material', '_id', 'word_count', 'uri', 'subsection_name'],
      dtype='object')

In [15]:
from textblob import TextBlob

def analyze_sentiment(text):
    blob = TextBlob(text)
    return blob.sentiment.polarity

In [16]:
# Ajouter une colonne sentiment au DataFrame

articles_obam['sentiment'] = articles_obam['snippet'].apply(analyze_sentiment)
articles_obam.head()

Unnamed: 0,abstract,web_url,snippet,lead_paragraph,print_section,print_page,source,multimedia,headline,keywords,...,document_type,news_desk,section_name,byline,type_of_material,_id,word_count,uri,subsection_name,sentiment
0,The streaming service picks Joe Rogan over Nei...,https://www.nytimes.com/2022/02/01/opinion/spo...,The streaming service picks Joe Rogan over Nei...,The streaming service Spotify would like us to...,SR,6.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'Spotify Backs Joe Rogan’s Disinforma...,"[{'name': 'persons', 'value': 'Rogan, Joe', 'r...",...,article,Editorial,Opinion,"{'original': 'By Greg Bensinger', 'person': [{...",Op-Ed,nyt://article/a86e6e3c-7687-589a-b27a-6b570784...,957,nyt://article/a86e6e3c-7687-589a-b27a-6b570784...,,0.1
1,With the backing of venture capital and well-k...,https://www.nytimes.com/2022/02/01/business/al...,With the backing of venture capital and well-k...,"In August 2017, Brian Wallach’s notion of time...",B,1.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'A Race to Rethink Care After a Dire ...,"[{'name': 'subject', 'value': 'Amyotrophic Lat...",...,article,Business,Business Day,"{'original': 'By Maureen Farrell', 'person': [...",News,nyt://article/d2a78924-19dc-551d-b2dd-665e27aa...,1375,nyt://article/d2a78924-19dc-551d-b2dd-665e27aa...,,0.0
2,A conversation with the co-founder of Run for ...,https://www.nytimes.com/2022/02/01/podcasts/tr...,A conversation with the co-founder of Run for ...,"Every Tuesday and Friday, Ezra Klein invites y...",,,The New York Times,[],{'main': 'Transcript: Ezra Klein Interviews Am...,"[{'name': 'persons', 'value': 'Klein, Ezra', '...",...,article,OpEd,Podcasts,"{'original': None, 'person': [], 'organization...",Op-Ed,nyt://article/94579bd5-6b92-5342-9fae-d2295bf8...,12451,nyt://article/94579bd5-6b92-5342-9fae-d2295bf8...,,0.0
3,Companies are wrestling with an array of state...,https://www.nytimes.com/2022/02/01/business/de...,Companies are wrestling with an array of state...,Now that the Biden administration’s nationwide...,,,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'States Are Complicating Corporate Pa...,"[{'name': 'organizations', 'value': 'Carlyle G...",...,article,Business,Business Day,"{'original': 'By Andrew Ross Sorkin, Jason Kar...",News,nyt://article/b22a88ad-1ede-5a11-b043-fb42e13b...,1807,nyt://article/b22a88ad-1ede-5a11-b043-fb42e13b...,DealBook,0.0
4,"A return to a 2015 accord is on the table, but...",https://www.nytimes.com/2022/01/31/us/politics...,"A return to a 2015 accord is on the table, but...",WASHINGTON — The United States and its Europea...,A,8.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'U.S. and Allies Close to Reviving Nu...,"[{'name': 'subject', 'value': 'Nuclear Weapons...",...,article,Washington,U.S.,"{'original': 'By David E. Sanger, Lara Jakes a...",News,nyt://article/1f993990-0666-5f54-9de8-957abfcb...,1317,nyt://article/1f993990-0666-5f54-9de8-957abfcb...,Politics,0.068182


In [17]:
#Fonction pour effectuer la partie tranformation des données du pipeline ETL

def transform_data(articles_obam):
    # Suppression des colonnes inutiles
    articles_obam.drop(['_id', 'document_type', 'multimedia', 'news_desk', 'print_page', 'print_section', 'section_name', 'snippet', 'type_of_material', 'word_count', 'subsection_name'], axis=1, inplace=True)

    # Renommage des colonnes
    articles_obam.rename(columns={'abstract': 'resume', 'headline.main': 'titre', 'pub_date': 'date', 'web_url': 'url'}, inplace=True)
    return articles_obam

In [18]:
articles_obam = transform_data(articles_obam)
articles_obam.head()

Unnamed: 0,resume,url,lead_paragraph,source,headline,keywords,date,byline,uri,sentiment
0,The streaming service picks Joe Rogan over Nei...,https://www.nytimes.com/2022/02/01/opinion/spo...,The streaming service Spotify would like us to...,The New York Times,{'main': 'Spotify Backs Joe Rogan’s Disinforma...,"[{'name': 'persons', 'value': 'Rogan, Joe', 'r...",2022-02-01T23:50:50+0000,"{'original': 'By Greg Bensinger', 'person': [{...",nyt://article/a86e6e3c-7687-589a-b27a-6b570784...,0.1
1,With the backing of venture capital and well-k...,https://www.nytimes.com/2022/02/01/business/al...,"In August 2017, Brian Wallach’s notion of time...",The New York Times,{'main': 'A Race to Rethink Care After a Dire ...,"[{'name': 'subject', 'value': 'Amyotrophic Lat...",2022-02-01T16:33:49+0000,"{'original': 'By Maureen Farrell', 'person': [...",nyt://article/d2a78924-19dc-551d-b2dd-665e27aa...,0.0
2,A conversation with the co-founder of Run for ...,https://www.nytimes.com/2022/02/01/podcasts/tr...,"Every Tuesday and Friday, Ezra Klein invites y...",The New York Times,{'main': 'Transcript: Ezra Klein Interviews Am...,"[{'name': 'persons', 'value': 'Klein, Ezra', '...",2022-02-01T15:09:47+0000,"{'original': None, 'person': [], 'organization...",nyt://article/94579bd5-6b92-5342-9fae-d2295bf8...,0.0
3,Companies are wrestling with an array of state...,https://www.nytimes.com/2022/02/01/business/de...,Now that the Biden administration’s nationwide...,The New York Times,{'main': 'States Are Complicating Corporate Pa...,"[{'name': 'organizations', 'value': 'Carlyle G...",2022-02-01T12:33:37+0000,"{'original': 'By Andrew Ross Sorkin, Jason Kar...",nyt://article/b22a88ad-1ede-5a11-b043-fb42e13b...,0.0
4,"A return to a 2015 accord is on the table, but...",https://www.nytimes.com/2022/01/31/us/politics...,WASHINGTON — The United States and its Europea...,The New York Times,{'main': 'U.S. and Allies Close to Reviving Nu...,"[{'name': 'subject', 'value': 'Nuclear Weapons...",2022-02-01T00:27:13+0000,"{'original': 'By David E. Sanger, Lara Jakes a...",nyt://article/1f993990-0666-5f54-9de8-957abfcb...,0.068182


# LOAD

In [19]:
# Fonction pour effectuer la partie chargement des données du pipeline ETL

def load_data(articles_obam):
    # Chargement du fichier articles_obam.csv
    articles_obam.to_csv('articles_obam.csv', index=False)

In [None]:
articles_obam = load_data(articles_obam)
articles_obam

# CREATION DE L'ARBORESCENCE DES FICHIERS

In [24]:
#Creation de l'aborescence des fichiers

import os
import shutil
import datetime

# chemin de base pour stocker les fichiers d'articles_obam
os.makedirs("Data_proj", exist_ok=True)
base_dir = "Data_proj"

# fonction pour déterminer le chemin de destination pour un fichier en fonction de sa date
def get_path_for_date(date):
    year = date.year
    month = date.month
    day = date.day
    return os.path.join(base_dir, str(year), str(month), str(day))

# boucle pour traiter chaque fichier d'article
for filename in os.listdir("Data_proj"):
    # supposer que le nom du fichier contient la date au format "YYYY-MM-DD"
    date_str = filename[:10]
    date = datetime.datetime.strptime(date_str, "%Y-%m-%d").date()
    
    # déterminer le chemin de destination pour le fichier
    dest_path = get_path_for_date(date)
    
    # créer les dossiers nécessaires s'ils n'existent pas encore
    os.makedirs(dest_path, exist_ok=True)
    
    # déplacer le fichier dans le dossier approprié
    shutil.move(os.path.join("Data_proj", filename), dest_path)


# AIRFLOW

In [26]:
%pip install apache-airflow

Collecting apache-airflow
  Downloading apache_airflow-2.5.1-py3-none-any.whl (11.8 MB)
     -------------------------------------- 11.8/11.8 MB 960.5 kB/s eta 0:00:00
Collecting python-nvd3>=0.15.0
  Downloading python-nvd3-0.15.0.tar.gz (31 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting argcomplete>=1.10
  Downloading argcomplete-2.0.0-py2.py3-none-any.whl (37 kB)
Collecting cattrs>=22.1.0
  Downloading cattrs-22.2.0-py3-none-any.whl (35 kB)
Collecting apache-airflow-providers-imap
  Downloading apache_airflow_providers_imap-3.1.1-py3-none-any.whl (17 kB)
Collecting mdit-py-plugins>=0.3.0
  Downloading mdit_py_plugins-0.3.4-py3-none-any.whl (52 kB)
     -------------------------------------- 52.1/52.1 kB 243.2 kB/s eta 0:00:00
Collecting gunicorn>=20.1.0
  Downloading gunicorn-20.1.0-py3-none-any.whl (79 kB)
     ---------------------------------------- 79.5/79.5 kB 4.6 MB/s eta 0:00:00
Collecting pendulum>=2.0
  D

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
anaconda-project 0.11.1 requires ruamel-yaml, which is not installed.


In [28]:
%pip install ruamel-yaml

Collecting ruamel-yaml
  Downloading ruamel.yaml-0.17.21-py3-none-any.whl (109 kB)
     ------------------------------------ 109.5/109.5 kB 530.7 kB/s eta 0:00:00
Collecting ruamel.yaml.clib>=0.2.6
  Downloading ruamel.yaml.clib-0.2.7-cp39-cp39-win_amd64.whl (118 kB)
     -------------------------------------- 118.4/118.4 kB 2.3 MB/s eta 0:00:00
Installing collected packages: ruamel.yaml.clib, ruamel-yaml
Successfully installed ruamel-yaml-0.17.21 ruamel.yaml.clib-0.2.7
Note: you may need to restart the kernel to use updated packages.


In [29]:
%pip install apache-airflow

Note: you may need to restart the kernel to use updated packages.


In [27]:
# Importation des librairies
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta

# Définition des arguments du DAG
args = {
    'owner': 'airflow',
    'start_date': airflow.utils.dates.days_ago(2),
    'depends_on_past': False,
    'email': [''],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

# Définition du DAG
dag = DAG(
    dag_id='articles_obam',
    default_args=args,
    description='Extraction, transformation et chargement des articles_obam',
    schedule_interval=timedelta(days=1),
)

# Définition des tâches
t1 = BashOperator(
    task_id='extract_data',
    bash_command='python3 extract_data.py',
    dag=dag,
)

t2 = BashOperator(
    task_id='transform_data',
    bash_command='python3 transform_data.py',
    dag=dag,
)

t3 = BashOperator(
    task_id='load_data',
    bash_command='python3 load_data.py',
    dag=dag,
)

# Définition des dépendances
t1 >> t2 >> t3

# Lancement du DAG
dag.cli()

# Lancement du serveur web
airflow webserver -p 8080

# Lancement du scheduler
airflow scheduler


SyntaxError: invalid syntax (706620132.py, line 1)

In [None]:

from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2022, 2, 1),
    'email': ['airflow@example.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'my_dag',
    default_args=default_args,
    schedule_interval='7 16 * * *',
)

task = BashOperator(
    task_id='my_task',
    bash_command='echo "Hello world"',
    dag=dag,
)


Jenkins est un outil idéal pour l'automatisation de tâches linéaires et pour l'intégration continue, tandis qu'Airflow est plus adapté pour la gestion de flux de travail complexes, avec des dépendances et des tâches qui doivent s'exécuter dans un ordre précis.