# Chapter 2 Import a Dataset

- [2.1 Python Modules & Import Data From Moldules](#2.1)
    - [2.1.1 pip install modulename](#2.1.1)
    - [2.1.2 A soft introduction of object oriented programming](#2.1.2)
- [2.2 Import Data From Different Data Files](#2.2)
    - [2.2.1 Read .csv file](#2.2.1)
    - [2.2.2 Read .txt file](#2.2.2)
- [2.3 Import Data From API](#2.3)

<div id="2.1"> </div>

## 2.1 Python Modules & Import Data From Moldules

*Modules (a.k.a. Packages)* are *Python* files that contain functions and variables. You can access these modules and make reuse of their code to solve your problem.

One advantage of the *Anaconda* distribution of *Python* is that it already comes with a number of built-in modules, so that we do not need to spend time downloading and managing these files. However, if you want to add a new package to the root environment, you can use the either *pip* or *conda* command line tool that comes with *Anaconda*. 

<div id="2.1.1"> </div>

### 2.1.1 pip install modulename

Open a **Terminal** in JupyterLab by clicking the <kbd>+</kbd> button in the upper-right corner of the screen. This step is the same as creating a new NoteBook. Once the *Launcher* window is opened, find *Terminal* under the *Others* Section.


> ![Create a terminal](images/chapter2/Create_A_Terminal.png)


In the terminal, type `pip install modulename` or `conda install modulename` to install that module to the root environment of Anaconda. The difference between *pip* and *conda* is that they download Modules from different cloud repositories. 

Let's install the **wooldridge** package to the default environment, type the following command in the Termnial.

`pip install wooldridge` 

Now, we can import this module. Recall how to import a module.

In [24]:
import wooldridge as woo

> *Don't forget to execute the code by pressing <kbd>Shift</kbd>+<kbd>return</kbd>.

Coding is never as intuitive as a graphical user interface. We do not have drop-down menus or buttons with names on it. Instead, we need to go the old-fashioned way - reading a manual (a.k.a as **API** or **documentation**). Google "python wooldridge" to find the following [wooldridge documentation](https://pypi.org/project/wooldridge/). It will instruct on how to use this Module.

In [25]:
# Here I want to show you another way to add comments to your code.
# Instead of using the Markdown mode in JupyterLab, anything after a # is treated as a comment by Ptyhon

# import dataset called 'wage1' and assign it a variable called wage1
wage1 = woo.data("wage1")

# get type of the this object
print(type(wage1))

<class 'pandas.core.frame.DataFrame'>


<div id="2.1.2"> </div>

### 2.1.2 A soft introduction of object oriented programming

Python is an object-oriented programming language, which means our coding logics are based on **objects**. Object in Python is a simialr concept as the real-world object. For example, this notebook is an object, the blackboard is an object, you computer is an object. Similarly, the wage1 dataset is an object, the number "1" is an object, and the "Hello World" string is an object. We can categorize objects into different **classes**, so that objects in the same class should share some common features. The documentation of that class would document the **attributes** (properties such as name, length, size ..) and **methods** (what it can do, such as go(), turn(), move()) of all objects that belong to that class.  

We noticed that *wage1* is a pandas DataFrame (i.e. this object belongs to the DataFrame class). To see what it can do, we need to look up the [pandas DataFrame documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).

Try to locate the head() method. 

In the documentation, we find that the head method takes one *parameter* - n, and returns the same type as the caller -  a DataFrame.


> ![DataFrame.head](images/chapter2/DataFrame.head.png)


To use this parameter, we can either write out the full assignment `wage1.head(n=5)` or ignore the parameter name and the = sign.

In [26]:
wage1.head(5)

Unnamed: 0,wage,educ,exper,tenure,nonwhite,female,married,numdep,smsa,northcen,...,trcommpu,trade,services,profserv,profocc,clerocc,servocc,lwage,expersq,tenursq
0,3.1,11,2,0,0,1,0,2,1,0,...,0,0,0,0,0,0,0,1.131402,4,0
1,3.24,12,22,2,0,1,1,3,1,0,...,0,0,1,0,0,0,1,1.175573,484,4
2,3.0,11,2,0,0,0,0,2,0,0,...,0,1,0,0,0,0,0,1.098612,4,0
3,6.0,8,44,28,0,0,1,0,1,0,...,0,0,0,0,0,1,0,1.791759,1936,784
4,5.3,12,7,2,0,0,1,1,0,0,...,0,0,0,0,0,0,0,1.667707,49,4


This method returns the first "5" (the value of the parameter) rows of the wage1 dataset. We will continue to discuss the descriptive data analysis with pandas DataFrame in the next chapter. But now let's focus on how to load data from other types of files. (To free up the memory of your computer, you can delete this object by typing `del wage1`)

In [27]:
del wage1

<div id="2.2"> </div>

## 2.2 Import Data From Different Data Files

Besides the example datasets from the wooldridge module, we would encounter a lot of common data files in our daily workflow. Common files name extensions for these data files are *RAW*, *CSV* or *TXT*. Knowing how to import these datafiles is a critical skill for a data analyst.

<div id="2.2.1"> </div>

### 2.2.1 Read .csv file

Fortunately, the **pandas** module provides the methods for importing these files. (Try goolge "pandas read csv" to find [its documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)).


---

> <img src="images/chapter2/Read_CSV.png" alt="pandas read_csv" width="60%">

---

The pandas.read_csv() *Method* takes more than one **parameters**. Parameters in a **Python function** are used to set options and configure the function. If a function takes more than one parameters, they should be seperated using a <kbd>,</kbd>. 

Let's read in the wage1.csv file 
1. from the following path `"data/wage1.csv"`
2. with the header line, 
3. and only read in the "wage", "educ", and "exper" columns 
4. and assign this *object* to a variable called wage1.

In [28]:
wage1 = pandas.read_csv("data/wage1.csv")

NameError: name 'pandas' is not defined

Oops, another *NameError*! To use this the pandas module, we need to import it first

In [None]:
import pandas as pd

wage1 = pd.read_csv("data/wage1.csv", header = "infer", usecols=["wage","educ","exper"]) 

"""
Another way to leave comments! There quotations give you a comment block!

1. Note the first parameter "filepath_or_buffer=" is not spelled out. When a parameter is provided 
in the documented order, the name of the parameter and the = sign are optional.

2. Learn a new DataType (a built-in class)
"""

wage1.head(5)

<div id="2.2.2"> </div>

### 2.2.2 Read .txt file

Let's import another .txt data file. Before importing, you can navigate to the `data/wine.txt` file in the left pane and double click to preview the file in JupyterLab.

From the preview, we find that this data file provides headers in the first row, and indexes in the first column. Let's assign these 1 to the header parameter and also the index_col parameter. We do not quote 1 because the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_table.html) states that these two parameters accept an **int** datatype, instead of the **str** datatype. (Built-in classes are called datatypes)

In [29]:
# import txt with pandas:

wine = pd.read_table("data/wine.txt", sep="\t", header = 1, index_col = 1)

wine.head()

Unnamed: 0_level_0,0,2.5,785,211,15.300000190734863
Australia,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Austria,1,3.9,863,167,45.599998
Belg/Lux,2,2.9,883,131,20.700001
Canada,3,2.4,793,191,16.4
Denmark,4,2.9,971,220,23.9
Finland,5,0.8,970,297,19.0


<div id="2.3"> </div>

## 2.3 Import Data From API

We can also load data not locally stored on your computer. To do this, we need to ask *Python* to query an online databases throught its **Application Programming Interface (API)**, which is just a "menu" provided by that data provider instructing on what data they have and how to retreive them.

A module called **pandas_datareader** makes it straight forward to query a lot of online data. It is not part of the Anaconda distribution so we need to install this module first. 

Open up a Terminal, and type `pip install pandas-datareader`. Your computer will download this module from PyPI and save it at the Anaconda root environment.

The following script demonstrats the workflow of importing stock data of Ford Motor Company. Read this [documentation](https://pandas-datareader.readthedocs.io/en/latest/remote_data.html#fred) to see more about how to use pandas-datareader.

In [30]:
import pandas_datareader as pdr
from datetime import date

# create some variables to store the information we need

ticker = ["F"]
start_date = "2021-1-1"
end_date = date.today() # the date module helps us find the current date

# import data from the yahoo finance
F_data = pdr.data.DataReader(ticker, "yahoo", start_date, end_date)

F_data.head()

Attributes,Adj Close,Close,High,Low,Open,Volume
Symbols,F,F,F,F,F,F
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2021-01-04,8.52,8.52,8.84,8.43,8.81,85043100
2021-01-05,8.65,8.65,8.72,8.46,8.47,70127800
2021-01-06,8.84,8.84,8.94,8.68,8.79,72590200
2021-01-07,9.06,9.06,9.08,8.88,8.94,77117100
2021-01-08,9.0,9.0,9.14,8.89,9.1,59162200


In [31]:
# Or take a look at the last 5 rows

F_data.tail()

Attributes,Adj Close,Close,High,Low,Open,Volume
Symbols,F,F,F,F,F,F
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2021-04-20,11.45,11.45,12.06,11.35,12.06,83170800
2021-04-21,11.73,11.73,11.74,11.18,11.36,49641100
2021-04-22,11.94,11.94,12.15,11.83,12.06,73064400
2021-04-23,12.22,12.22,12.24,11.87,11.97,51833300
2021-04-26,12.27,12.27,12.4414,12.23,12.28,39977839
