# Cours 6 - Data collection 

**Objectifs du cours:**
- explorer les bases du web scraping avec python
- connaitre les principales ressources pour trouver des datasets
- choisir un dataset sur lequel le projet sera effectué 

## Les ressouces pour trouver des datasets

Je vous invite à consulter la liste (non exhaustive) des sites suivants pour avoir une idée des datasets que vous pouvez trouver en ligne.

- https://www.data.gouv.fr/fr/
- https://www.kaggle.com/datasets
- https://www.data.gov
- https://github.com/awesomedata/awesome-public-datasets
- https://data.gov.uk
- https://worlddata.ai/partners/kdnuggets
- https://github.com/niderhoff/nlp-datasets
- https://opendata.paris.fr/pages/home/

**Exercice:** En voyant la diversité des datasets qui existent et pour vous aider à faire votre choix essayez de vous poser les questions suivantes:
- Quels domaines m'intéressent le plus (Biologie, Sport, Economie, etc ...) ?
- Avec quels types de variables est ce que j'aimerais travailler (texte, catégories, numériques, dates, ...) ?
- Est ce qu'il y a une ou des questions particulières que j'aimerais explorer ? 

## Introduction au web scraping

Comme exemple, voyons comment j'ai fait pour récupérer les donnés pour construire pays.sql 

Les données viennent du suite suivant: http://doheth.co.uk/info/countries-of-the-world.php

Nous allons utiliser 2 librairies différentes: `requests`, pour faire une requete http vers notre site, et `BeautifulSoup`, pour récupérer le html de notre site.

In [1]:
import requests
from bs4 import BeautifulSoup

URL = "http://doheth.co.uk/info/countries-of-the-world.php"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")

In [2]:
soup

<!DOCTYPE html>

<html>
<head>
<meta charset="utf-8"/>
<title>Countries of the World ~ capital cities, populations, sizes and more</title>
<link href="/info/geography.css" rel="stylesheet"/>
<meta content="List of all countries in the world with capital cities and more details." name="description"/>
<script src="/js/jquery.js"></script>
<script src="/js/jquery.tablesorter.js"></script>
<script src="/js/jquery.tablesorter.parsers.js"></script>
<script>
$(document).ready(function(){
	// sort table; disable sorting on currency
	$("#countries").tablesorter({
		sortList: [[0,0]],
		headers: { 3:{sorter:false}, 4:{sorter:'num'}, 5:{sorter:'num'}, 6:{sorter:'num'}, 7:{sorter:'num'}, 8:{sorter:'num'}, 9:{sorter:'num'}, 10:{sorter:'num'} }
	})
});
</script>
</head>
<body>
<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.pare

Nous voyons en inspectant le code source que les informations des pays se situent dans les balises `tr`.

In [3]:
all_data= soup.find_all("tr")
all_data

[<tr>
 <th>Name<br/>
 <small>(English)</small></th>
 <th>Location<br/>
 <small>(Continent)</small></th>
 <th>Capital city<br/>
 <small>(official)</small></th>
 <th>Currency<br/>
 <small>(poss. multiple)</small></th>
 <th>Population<br/>
 <small>(inhabitants)</small></th>
 <th>Area<br/>
 <small>(km²)</small></th>
 <th>Pop. density<br/>
 <small>(inhabitants/km²)</small></th>
 <th>GDP (nominal)<br/>
 <small>(millions of USD)</small></th>
 <th>Life exp.<br/>
 <small>(years)</small></th>
 <th>Birth rate<br/>
 <small>(births/1,000)</small></th>
 <th>Death rate<br/>
 <small>(deaths/1,000)</small></th>
 </tr>,
 <tr>
 <td class="name">Abkhazia</td>
 <td>Asia<br/>
 <small>(Central West)</small></td>
 <td>Sukhumi</td>
 <td>Georgian lari;<br/>Russian ruble</td>
 <td class="numeric">216,000</td>
 <td class="numeric">8,600</td>
 <td class="numeric">25</td>
 <td class="numeric">-</td>
 <td class="numeric">-</td>
 <td class="numeric">-</td>
 <td class="numeric">-</td>
 </tr>,
 <tr>
 <td class="name">A

Regardons maintenant la structure d'une de nos balises avec plus de détail.

In [5]:
all_data[1]

<tr>
<td class="name">Abkhazia</td>
<td>Asia<br/>
<small>(Central West)</small></td>
<td>Sukhumi</td>
<td>Georgian lari;<br/>Russian ruble</td>
<td class="numeric">216,000</td>
<td class="numeric">8,600</td>
<td class="numeric">25</td>
<td class="numeric">-</td>
<td class="numeric">-</td>
<td class="numeric">-</td>
<td class="numeric">-</td>
</tr>

Les informations que nous voulons extraire de situent dans les balises `td`.

In [15]:
import re 

data = []
for country in all_data[1:]:
    temp_data = []
    content = country.find_all("td")
    for line in content:
        temp_data.append(re.split('>|<',str(line))[2])      
    data.append(temp_data)

print(data)

[['Abkhazia', 'Asia', 'Sukhumi', 'Georgian lari;', '216,000', '8,600', '25', '-', '-', '-', '-'], ['Afghanistan', 'Asia', 'Kabul', 'Afghan afghani', '29,863,010', '652,090', '46', '7,168', '42.90', '46.60', '20.75'], ['Albania', 'Europe', 'Tirana', 'Albanian lek', '3,129,678', '28,748', '109', '8,379', '77.24', '15.11', '5.12'], ['Algeria', 'Africa', 'Algiers', 'Algerian dinar', '32,853,800', '2,381,741', '13', '102,257', '73.00', '17.14', '4.60'], ['Andorra', 'Europe', 'Andorra la Vella', 'Euro', '67,151', '468', '143', '960', '83.51', '8.71', '6.07'], ['Angola', 'Africa', 'Luanda', 'Angolan kwanza', '15,941,390', '1,246,700', '12', '28,038', '38.43', '45.11', '24.50'], ['Antigua and Barbuda', 'North America', "St. John's", 'East Caribbean dollar', '81,479', '442', '184', '905', '71.90', '16.93', '5.44'], ['Argentina', 'South America', 'Buenos Aires', 'Argentine peso', '38,747,150', '2,780,400', '13', '183,309', '75.91', '16.73', '7.56'], ['Armenia', 'Asia', 'Yerevan', 'Armenian dram'

Nous voulons maintenant ajouter les labels des colonnes à notre liste de listes.

In [16]:
header = []
content = all_data[0].find_all("th")
for line in content:
    header.append(re.split('>|<',str(line))[2]) 

header

['Name',
 'Location',
 'Capital city',
 'Currency',
 'Population',
 'Area',
 'Pop. density',
 'GDP (nominal)',
 'Life exp.',
 'Birth rate',
 'Death rate']

In [17]:
data.insert(0, header)
data

[['Name',
  'Location',
  'Capital city',
  'Currency',
  'Population',
  'Area',
  'Pop. density',
  'GDP (nominal)',
  'Life exp.',
  'Birth rate',
  'Death rate'],
 ['Abkhazia',
  'Asia',
  'Sukhumi',
  'Georgian lari;',
  '216,000',
  '8,600',
  '25',
  '-',
  '-',
  '-',
  '-'],
 ['Afghanistan',
  'Asia',
  'Kabul',
  'Afghan afghani',
  '29,863,010',
  '652,090',
  '46',
  '7,168',
  '42.90',
  '46.60',
  '20.75'],
 ['Albania',
  'Europe',
  'Tirana',
  'Albanian lek',
  '3,129,678',
  '28,748',
  '109',
  '8,379',
  '77.24',
  '15.11',
  '5.12'],
 ['Algeria',
  'Africa',
  'Algiers',
  'Algerian dinar',
  '32,853,800',
  '2,381,741',
  '13',
  '102,257',
  '73.00',
  '17.14',
  '4.60'],
 ['Andorra',
  'Europe',
  'Andorra la Vella',
  'Euro',
  '67,151',
  '468',
  '143',
  '960',
  '83.51',
  '8.71',
  '6.07'],
 ['Angola',
  'Africa',
  'Luanda',
  'Angolan kwanza',
  '15,941,390',
  '1,246,700',
  '12',
  '28,038',
  '38.43',
  '45.11',
  '24.50'],
 ['Antigua and Barbuda',
  '

In [25]:
import pandas as pd

df = pd.DataFrame.from_records(data[1:], columns=data[0])
df

Unnamed: 0,Name,Location,Capital city,Currency,Population,Area,Pop. density,GDP (nominal),Life exp.,Birth rate,Death rate
0,Abkhazia,Asia,Sukhumi,Georgian lari;,216000,8600,25,-,-,-,-
1,Afghanistan,Asia,Kabul,Afghan afghani,29863010,652090,46,7168,42.90,46.60,20.75
2,Albania,Europe,Tirana,Albanian lek,3129678,28748,109,8379,77.24,15.11,5.12
3,Algeria,Africa,Algiers,Algerian dinar,32853800,2381741,13,102257,73.00,17.14,4.60
4,Andorra,Europe,Andorra la Vella,Euro,67151,468,143,960,83.51,8.71,6.07
...,...,...,...,...,...,...,...,...,...,...,...
194,Venezuela,South America,Caracas,Venezuelan bolívar,26749110,912050,29,138857,74.31,18.71,4.90
195,Vietnam,Asia,Hanoi,Vietnamese d?ng,84238230,331689,254,52408,70.61,16.86,6.20
196,Yemen,Asia,Sanaá,Yemeni rial,20974660,527968,40,14452,61.75,42.89,8.53
197,Zambia,Africa,Lusaka,Zambian kwacha,11668460,752618,15,7257,39.70,41.00,20.23


In [26]:
#df.to_csv('pays.csv')

**Exercice:** Faites un dataframe du tableau de la page suivante http://doheth.co.uk/info/us-states.php

## Scraping de Tweets

Il existe plusieurs moyens pour récupérer des tweets: l'API Twitter, commande bash snscrape, librairie python Tweepy, etc ...

Nous allons juste voir rapidement comment utiliser `snscrape` https://medium.com/dataseries/how-to-scrape-millions-of-tweets-using-snscrape-195ee3594721

Essayons de récupérer des tweets ayant le mot "marvel" seulement sur 2 jours.

In [27]:
%%bash
snscrape --jsonl --progress --max-results 50000 --since 2021-07-27 twitter-search 'lang:fr marvel until:2021-07-28' > tweets.json

Scraping, 100 results so far
Scraping, 200 results so far
Scraping, 300 results so far
Scraping, 400 results so far
Scraping, 500 results so far
Scraping, 600 results so far
Scraping, 700 results so far


In [29]:
twitter_data = pd.read_json('tweets.json', lines=True)
twitter_data

Unnamed: 0,url,date,content,renderedContent,id,user,outlinks,tcooutlinks,replyCount,retweetCount,...,lang,source,sourceUrl,sourceLabel,media,retweetedTweet,quotedTweet,mentionedUsers,coordinates,place
0,https://twitter.com/ununoctiium/status/1420171...,2021-07-27 23:58:18+00:00,@LuluciferGrt Et cumberbatch est définitivemen...,@LuluciferGrt Et cumberbatch est définitivemen...,1420171722171551744,"{'username': 'ununoctiium', 'displayname': 'ju...",[],[],1,0,...,fr,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,,"[{'username': 'LuluciferGrt', 'displayname': '...",,
1,https://twitter.com/inquieto_Romeo/status/1420...,2021-07-27 23:55:19+00:00,- mas é q a marvel... https://t.co/4x2UEA8f11,- mas é q a marvel... https://t.co/4x2UEA8f11,1420170971219124224,"{'username': 'inquieto_Romeo', 'displayname': ...",[],[],0,0,...,fr,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,[{'previewUrl': 'https://pbs.twimg.com/media/E...,,,,,
2,https://twitter.com/the_abyssion/status/142016...,2021-07-27 23:36:30+00:00,@NetflixFR Ça y est ils nous font une marvel,@NetflixFR Ça y est ils nous font une marvel,1420166236168761347,"{'username': 'the_abyssion', 'displayname': 't...",[],[],1,0,...,fr,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,,,,"[{'username': 'NetflixFR', 'displayname': 'Net...",,
3,https://twitter.com/benialloush/status/1420165...,2021-07-27 23:35:20+00:00,Demain dans le train jsp si je regarde un Marv...,Demain dans le train jsp si je regarde un Marv...,1420165942550609923,"{'username': 'benialloush', 'displayname': 'ch...",[],[],0,0,...,fr,"<a href=""http://twitter.com/download/android"" ...",http://twitter.com/download/android,Twitter for Android,,,,,,
4,https://twitter.com/zougagmehd/status/14201641...,2021-07-27 23:28:05+00:00,@Marvel_Fit Ca se voit de ouf que c’est fake,@Marvel_Fit Ca se voit de ouf que c’est fake,1420164115406929921,"{'username': 'zougagmehd', 'displayname': 'Zou...",[],[],0,0,...,fr,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,,"[{'username': 'Marvel_Fit', 'displayname': 'Ma...",,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
768,https://twitter.com/7faat/status/1419813908995...,2021-07-27 00:16:29+00:00,J’ai envie de recommencer tout les films de Ma...,J’ai envie de recommencer tout les films de Ma...,1419813908995575809,"{'username': '7faat', 'displayname': 'Fat❤️‍🔥'...",[],[],1,1,...,fr,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,,,,
769,https://twitter.com/orlandead/status/141981238...,2021-07-27 00:10:25+00:00,Puis jsp pourquoi les fans de Marvel ont peur ...,Puis jsp pourquoi les fans de Marvel ont peur ...,1419812381039996929,"{'username': 'orlandead', 'displayname': 'TRAN...",[],[],0,0,...,fr,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,,,,
770,https://twitter.com/alwaysxhalo/status/1419812...,2021-07-27 00:09:08+00:00,favourite marvel character?,favourite marvel character?,1419812059743756296,"{'username': 'alwaysxhalo', 'displayname': 'Jo...",[],[],1,0,...,fr,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,,,,
771,https://twitter.com/sahvasylui/status/14198114...,2021-07-27 00:06:33+00:00,@ksnawej c’est le pire marvel pour l’instant,@ksnawej c’est le pire marvel pour l’instant,1419811410381578250,"{'username': 'sahvasylui', 'displayname': 'DLK...",[],[],1,0,...,fr,"<a href=""http://twitter.com/download/iphone"" r...",http://twitter.com/download/iphone,Twitter for iPhone,,,,"[{'username': 'ksnawej', 'displayname': 'nana'...",,


## Data Storytelling

![title](DataStorytelling.png)

Je vous invite à regarder le site suivant qui démontrent parfaitement l'idée de raconter une histoire avec des données:

https://pudding.cool

## A vous de jouer ! 

Maintenant à vous de trouver un dataset et une question à laquelle vous voulez répondre pour votre projet !