# Data Science

## 1. Introduction

### Virtual environments  

One issue that comes up very often is that you have a script which requires version X of the library, whereas another of your script was written with version Y of that same library.

To avoid **version conflicts**, it is generally recommended to set up a virtual environment. Python comes shipped in with the `venv` module for doing just that: see [here](https://docs.python.org/3/library/venv.html).

1. Set up a virtual environment called `my-venv`: ```python3 -m venv my-venv```
2. Activate it, on NIX: ```source my-venv/bin/activate```, on windows: ```my-venv\Scripts\activate.bat```
3. Install whatever you need, do your work on your code
4. Once you're done working, close the environment using the `deactivate` command. 

Following these steps will minimize the number of version conflicts you encounter, by making sure that your installations are compartimented on a per-project basis. 
Also, you should try to come up with a better name than the example `my-env` given here.

You can find a summary of all third party libraries you have install using the command `pip freeze`
You can create a **requirements file**, where each line corresponds to a library (potentially with its version number). Assuming you have such a file, you can install all required libraries as listed in the file by running:
```
pip install -r requirements.txt
```
Note that the output of `pip freeze` corresponds exactly to the file format expected by `pip install -r`. It's therefore fairly customary on NIX systems to run something like:
```
pip freeze > requirements.txt
```
so that you can share your requirements file along with your code for anyone trying to reproduce your code.  

If you simply want to know what libraries are in your virtual environment, you can run:  
```
pip list
```

### Git  
* __[basic commands](https://git-scm.com/docs/gittutorial)__  
* __[exercises](https://gitexercises.fracz.com/)__  


## 2. Web basics and scraping

The command *curl* allows to download a target-page.
```
$ curl adress -o output.html
```  
For instance:  
```
$ curl https://fr.wikipedia.org/wiki/Montserrat_(Robl%C3%A8s) -o montserrat.html
```

### Requests

In [3]:
import requests
ua = {'User-agent': 'Mozilla/5.0'}
page = requests.get("http://synalp.loria.fr/index.html", headers=ua)
print(page.content.decode("utf-8"))

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <meta name="description" content="Homepage of the SyNaLP research team">
        <meta name="author" content="The SyNaLP team">
        <link rel="canonical" href="https://synalp.loria.fr/">
        <link rel="shortcut icon" href="img/favicon.ico">
        <title>SyNaLP</title>
        <link href="css/bootstrap.min.css" rel="stylesheet">
        <link href="css/font-awesome.min.css" rel="stylesheet">
        <link href="css/base.css" rel="stylesheet">
        <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.5.0/styles/mono-blue.min.css">
        <link href="css/css_tab.css" rel="stylesheet">
        <link href="css/extra.css" rel="stylesheet">
        <link href="css/neoteroi-mkdocs.css" rel="stylesheet">

        <script src="js/j

### BeautifulSoup

In [24]:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="Homepage of the SyNaLP research team" name="description"/>
  <meta content="The SyNaLP team" name="author"/>
  <link href="https://synalp.loria.fr/" rel="canonical"/>
  <link href="img/favicon.ico" rel="shortcut icon"/>
  <title>
   SyNaLP
  </title>
  <link href="css/bootstrap.min.css" rel="stylesheet"/>
  <link href="css/font-awesome.min.css" rel="stylesheet"/>
  <link href="css/base.css" rel="stylesheet"/>
  <link href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.5.0/styles/mono-blue.min.css" rel="stylesheet"/>
  <link href="css/css_tab.css" rel="stylesheet"/>
  <link href="css/extra.css" rel="stylesheet"/>
  <link href="css/neoteroi-mkdocs.css" rel="stylesheet"/>
  <script defer="" src="js/jquery-1.10.2.min.js">
  </script>
  <script defer="" src="js/bootstrap

In [28]:
# with find_all : all the elements
soup.find_all("link")
soup.find_all(href = True)
soup.find_all("a", href = True, limit = 3)
soup.find_all(lambda x: x.has_attr("class")) # lambda functions
soup.find_all(href = re.compile("^https")) # regular expressions

# with find : one element
for child in soup.find(id = "navbar-collapse").children: # children : an iterator over children
    pass
    #print(child)
    
# select: more flexible requests
soup.select("div a")

[<a class="navbar-brand" href=".">SyNaLP</a>,
 <a class="nav-link" href="about/">About</a>,
 <a class="nav-link" href="softs/">Code</a>,
 <a class="nav-link" href="events/">Events</a>,
 <a class="nav-link" href="members/">Members</a>,
 <a class="nav-link" href="news/">News</a>,
 <a class="nav-link" href="projects/">Projects</a>,
 <a class="nav-link" href="pub_years/">Publications</a>,
 <a class="nav-link" href="practical/">Reaching us</a>,
 <a class="nav-link" data-target="#mkdocs_search_modal" data-toggle="modal" href="#">
 <i class="fa fa-search"></i> Search
                             </a>,
 <a class="nav-link" href="https://gitlab.inria.fr/synalp/synalp-website/-/edit/master/docs/index.md"><i class="fa fa-gitlab"></i> Edit on GitLab</a>,
 <a href="http://en.nancy-tourisme.fr">Nancy</a>,
 <a href="https://www.loria.fr/en">LORIA</a>,
 <a href="http://www.cnrs.fr/index.php">French National Scientific Research Center</a>,
 <a href="http://welcome.univ-lorraine.fr/en/research">Universi

#### Scrapy  
[video 1](https://www.youtube.com/watch?v=s4jtkzHhLzY&list=PLRzwgpycm-Fjvdf7RpmxnPMyJ80RecJjv&index=2)  
[video 2](https://www.youtube.com/watch?v=m_3gjHGxIJc) 


In [31]:
import scrapy

In [29]:
webpage = "http://quotes.toscrape.com/"

## 3. Web services (to be completed)

## 4. Storing data

It is easier to **query** stored data when we store it in a **database**.

### MongoDB

* Run MongoDB: __[instructions](https://www.mongodb.com/docs/manual/tutorial/install-mongodb-on-debian/)__

### Practical Session