# Reading data from csv files

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('test_file.csv')
df

Unnamed: 0,a,b,c,d
0,yellow,10,2,3.2
1,green,2,3,8.1
2,blue,7,1,0.4


In [3]:
pd.read_csv('test_file.csv',names=['column 1','column 2','column 3','column 4'])

Unnamed: 0,column 1,column 2,column 3,column 4
0,a,b,c,d
1,yellow,10,2,3.2
2,green,2,3,8.1
3,blue,7,1,0.4


In [4]:
pd.read_csv('test_file.csv', index_col=0)

Unnamed: 0_level_0,b,c,d
a,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
yellow,10,2,3.2
green,2,3,8.1
blue,7,1,0.4


In [5]:
df.dtypes

a     object
b      int64
c      int64
d    float64
dtype: object

In [6]:
#Forcer le format

df2 = pd.read_csv('test_file.csv',  dtype = { 'b' : np.float64})
df2.dtypes

a     object
b    float64
c      int64
d    float64
dtype: object

#### Loading Partial

In [7]:
pd.read_csv("test_file.csv", usecols=['a', 'b'])

Unnamed: 0,a,b
0,yellow,10
1,green,2
2,blue,7


# Reading data from Excel files

In [8]:
import pandas as pd
pd.read_excel('data.xls')

FileNotFoundError: [Errno 2] No such file or directory: 'data.xls'

In [None]:
pd.read_excel('data.xls', sheet_name='Sheet2')

In [None]:
pd.read_excel('data.xls', sheet_name='Sheet2', usecols = ['varD','varE'])

# JSON data

When dealing with data on the web, the most common format that we will come across is JSON, which stands for JavaScript Object Notation. In a nutshell, JSON is a file format used to transmit information between web servers and clients or browsers in logical and structured manner. It was first developed in the early 2000s as a response to a need for a better server-to-browser communication protocol. As suggested by its name, it was originally derived from the JavaScript programming language; however, unlike JavaScript objects, a JSON object can be transferred between different programming languages in a format that all languages can work with. In fact, nowadays, almost all programming languages contain functions or libraries that can read and write JSON data.


#### Syntax and structure
JSON can contain two types of elements:

- JSON objects
- arrays

A JSON object is essentially just a key-value data format that is stored inside curly brackets. Here is an example:

In [None]:
{
  "userID": 12345,
  "userName": "John Smith"
}

An array is an ordered collection that can contain values of different data types. The main syntactical difference between JSON objects and arrays is that arrays are stored inside square brackets. We can use arrays as the value field of a JSON object as shown below

In [None]:
{
  "userID": 12345,
  "userName": "John Smith",
  "results": [
    {
      "test": "Verbal Reasoning",
      "score": 140
     },
    {
      "test":"Quantitative Reasoning",
       "score": 165
    },
    {
      "test":"Analytical Writing",
       "score": 5
    }
  ],
  "testCompleted": True
}

In [None]:
{
    "col1":
        {
            "row1":0,"row2":4,"row3":8,"row4":12
        },
    "col2":
        {
            "row1":1,"row2":5,"row3":9,"row4":13
        },
    "col3":
        {
            "row1":2,"row2":6,"row3":10,"row4":14
        },
    "col4":
        {
            "row1":3,"row2":7,"row3":11,"row4":15
        }
}

In [None]:
pd.read_json('frame.json')

Now in this example, the JSON file that we used was already in what is called tabular form. This means that we could directly load it as a DataFrame. However, this is not usually the case with JSON files.



In [None]:
pd.read_json('books.json')

### We can see that the data in this file is not longer tabular. If we try to load the data directly with the read_json() the result is not very useful:

In order to do this, we will need to perform some additional steps. We start by importing the json library for python, and a special function from pandas called json_normalize():

In [None]:
import json
from pandas.io.json import json_normalize

Let's load the data from the JSON file and convert it to an object (a dictionary, really) which we store in the variable dictionary:

In [None]:
with open('books.json', 'r') as f:
    json_string = f.read()
    dictionary = json.loads(json_string)

If we now type the command dictionary, we can see indeed the data from our file. Once we have the data in this format, we can apply a process known as normalization. It's called normalization because it "normalizes" JSON data, which can be quite complex in structure, into a flat table structure (a DataFrame, to be more precise). To do this, we use the json_normalize() function. This function turns an array of nested JSON objects into a DataFrame, with the columns corresponding to the different variables stored in the JSON file. We will pass as arguments the variable dictionary which contains the data as a dictionary, and then we have to mention a key, which is used for separating the entries. To know which key to use, we must look at our JSON file and see the name that is given before the entries. In our case, this name is books. Let's try this out:

In [None]:
json_normalize(dictionary, 'books')

# HTML files

The web is one of the major sources of data that you will encounter. Getting data from the web is known as **web scraping**, and it is a very useful skill in any data scientist's toolbox. It allows us to get data from the web that is not yet in a well-structured format which you can download directly for data analysis such as csv. **You might wonder, why don't we just copy and paste the data manually? Well, this might work for a small webpage but in general, we will be interested in scraping large amounts of data that would be extremely time consuming and completely impractical to do by hand**. Luckily, Python has several tools which help automate this process for us.

Before we get into it, a word of warning: Be cautious when crawling the web. In particular, some Terms of Services may explicitly prohibit you from scraping the website, and the data may itself be copyrighted. So be sure to understand what you're doing (here is a an interesting analysis of the problem).

#### What exactly is HTML?

We will not get into too much detail here about HTML, the HyperText Markup Language that powers the web, but we will cover some very basic facts that will be sufficient for you to perform successful web scraping. **HTML is the source code that generates a webpage.** When viewing a webpage in our web browser, we can look at its source code by right-clicking and selecting view page source or show page source, depending on the browser we are using. Here is an example:

We will exploit these patterns to retrieve the information that we want. We will be especially interested in the attributes **class and id**. These are special properties that give HTML elements names, and we can take advantage of these names when web scraping. An element can have multiple classes but only one id. When writing HTML code it is not necessary to give elements classes and ids however, so not all web pages might have these attributes.

#### The requests library

The first step in web scraping is to read the web page into python. This is done using the requests library, so we have to make sure that we first import it as follows:

In [9]:
import requests

In [10]:
page=requests.get('https://web.archive.org/web/20180908144902/http://en.proverbia.net/shortfamousquotes.asp')

In [11]:
page.status_code

200

In [13]:
page.text[0:100]

'\n<!DOCTYPE html>\n\n<html lang="en" xml:lang="en">\n<head><script src="//archive.org/includes/analytics'

# Web scraping