## I. Introduction to Python

**Econometrics** is a powerful and essential study for solving real-world problems. The practical implementation and application of econometric methids and tools helps tremendously with undestanding the concepts. Nowadays, a vast majority of peole will have to deal with some sort of data analysis in their career. Learning how to use some serious data analysis software is an invaluable asset for anyone of economics, business adminstration, and related fields.

Choosing a software package for learning econometrics could be a tough question to answer. Possibly the most important aspect is that it is widely used both in and outside academia. A large and active user community helps the software to remain up to date and increases the chances that somebody else has already solved the problem at hand. **Python** can be an ideal candidate for starting to learn econometrics and data analysis. It has a hugh user base, especially in the fields of data science, machine learning, and artificial intelligence, where it arguably is the most popular software overall. Also, Python is completely free and available for all relevant operating systems. 

In this section, we provide a gentle introduction of Python, cover some of the basic knowledge of the software, demonstrate with examples, and provide examples for practice.

### Topics:

1. Working Directory
2. Python Objects
3. Modules
4. External Data

### 1. Working Directory

Similar to many statistical software, when we are working on a particar project with the software, we need to interact with different files, such as import or export a data file, save a generated figure as a graphic file, store regression tables as text, spreadsheet, or LATEX file. Whenever we provide Python with a file name, it can include the full path on the computer.  The full (i.e. "absolute") path to a script file might be something like below on a Mac or Linus system.

```
/Users/Econometric-with-Python/Introduction_to_Python.ipynb
```

The path is provided for Unix based operating systems using forward slashes. If you are a Window user, you usually use back slashes instead of forward slashes, but the Unix-style will also work in Python.  On a Window system, a valid path would be

```
C:/Users/MyUserName/Desktop/Econometric-with-Python/Introduction_to_Python.ipynb
```

If we do not provide any path, Python will use the current "working directory" for reading or writing files.  AFter importing the module **os**, it can be obtained by the command *os.getcwd()*. To change the working directry, use the command *os.chdir(path)*. Relative pathys, are interpreted related to the current working directory. For a neat file organization, best practice is to generate a directory for each project (say *MyEconProject*) with several sub-directories (say *PyScripts*, *data*, and *images*). At the beginning of the script, we can use the command *os.chdir()* to set the working directory for the project and afterwards refer to a data set in the respective sub-directory as **data/MyData.csv** and to a graphics file as **images/MyFigure.png**.

Here is an example,

``` Python
# Loading the os module
import os

# Check the current working directory
os.getcwd()

# Change the working directory to desktop (Window)
os.chdir('C:/Users/MyUserName/Desktop') # Note: "MyUserName" should be the actual username of the machine

# Change the working directory to desktop (Mac / Linus)
os.chdir('Users/Desktop')

# Check to see if the working directory is changed
os.getcwd()
```

#### Practice:

Check your current working directory using the command *os.getcwd()*. Then, Change the working directory to **Desktop** using the command *os.chdir()*. Check to see if the working directory has changed. Again, using the command *os.chdir()* change back to the original directory and check afterward.

In [None]:
# Loading the os Module


In [None]:
# Check your current working directory


In [None]:
# Change the working directory to the Desktop


In [None]:
# Check to see if the directory has changed


In [None]:
# Change the directory back to the original one


In [None]:
# Check again to see if you are in the original directory


### 2. Python Objects

Python is an **Object Oriented Programming (OOP)** language, which relies on the concept of classes and objects. It is used to structure a software program into simple, reusable pieces of code blueprints (usually called classes), which are used to create individual instances of objects. Python can work with numbers, lists, arrays, texts, data sets, graphs, functions, and many objects of different types. This section covers the most important ones we will frequently encounter in econometric analysis. We begin with the built-in objects that are available with the standard distribution of Python, then introduce objects included in the modules **numpy** and **pandas**.

#### Variables

Just like many statistical software packages, we often want to store results of calculation to reuse them later. For this, we can assign result to a **variable**. A variable has a name and by this name we can access the assigned object. 

Here are some examples:

``` Python
# Assigning a value 5 to a variable x
x = 5
print(f'x has a value of: {x}')

# Assigning a value 10 to a variable y
y = 10
print(f'y has a value of: {y}')

# Assigning the value y divided by x to a varialbe z
z = y / x
print(f'z has a value of: {z}')
```

#### Practice:

Try to complete the following assignments and print their values.

- Assign 3 + 4 to variable "a"
- Assign 3.14 to variable "b"
- Assign "Hello World" to variable "c"

In [None]:
# Assign 3 + 4 to variable "a"


In [None]:
# Assign 3.14 to variable "b"


In [None]:
# Assign "Hello World" to variable "c"


#### Objects in Python

Once you assigned different values to the variables, you might wonder what kind of objects we have dealth with so far. In fact, we can use the command "**type**" to identify the object type.

Here are some examples:

``` Python
# Assigning a value 5 to a variable x
x = 5
x_type = type(x)
print(f'x is a: {x_type}')

# Assigning a value 2.5 to a variable y
y = 2.5
y_type = type(y)
print(f'y is a: {y_type}')

# Assigning the value "Python" to a varialbe z
z = "Python"
z_type = type(z)
print(f'z is a: {z_type}')
```

The command **type** tells us that we have created integers (**int**), floating point numbers (**float**), and text object / string (**str**). The data type not only defines what values can be stored, but also the actions can be perform on these objects. For example, if we want to add an integer to a string, Python will return:

```
TypeError: unsupported operand type(s) for +: 'int' and 'str'
```

#### Practice:

Try to use the **type** command to identify the data type of the following values,

- 987654321
- 3.1415926535
- "Hello World"

In [None]:
# Enter your code here!


In [None]:
# Enter your code here!


In [None]:
# Enter your code here!


Scalar data types like *int*, *float*, or *str* contain only one single value. A **Boolean** value, also called **logical** value, is another scalar data type that will become useful if you want to execute code only if one or more conditions are met. An object of type *bool* can only take one of two values: **True** or **False**. The easiest way to generate them is to state claims which are either true or false and let Python decide. **Table 1.1** is listing the main logical operators:

#### Logical Operators

**Table 1.1**

|  Operator  |  Description  |  Syntax  |
|  :---:  |  :---  |  :---:  |
|  ==  |  x is equal to y  |  x == y  |
|  <  |  x is less than y  |  x < y  |
|  <=  |  x is less than or equal to y  |  x <= y  |
|  >  |  x is greater than y  |  x > y  |
|  >=  |  x is greater than or equal to y  |  x >= y  |
|  !=  |  x is NOT equal to y  |  x != y  |
|  not  |  NOT b (i.e. True, if b is False  |  not b  |
|  or  |  either a or b is True (or both)  |  a or b  |
|  and  |  both a and b are True  |  a and b  |

As we saw in previous examples, scalar types differ in what kind of data they can be used for:

- int: whole numbers, for example 5 or 10000
- float: numbers with a decimal point, for example 2.25 or 12345.00
- str: any sequence of characters delimited by either single or double quotes, for example 'python' or "Hello World"
- bool: either **True** or **False**

#### Collection of Objects

For statistical calculations, we often need to work with data sets including many numbers or texts instead of scalars. The simplest way we can collect components (even components of different types) is called a **list** in Python terminlogy, which is similar to **vector** in R. To define a **list**, we can collect differetn values using square brackets [value1, value2, ...]. We can access a list entry by providing the position (starting at 0) within square brackets next to the variable name referencing the list. We can also access a range of values by using their starting position *i* and end position *j* with the syntex listname[i:(j+1)].

Here are some examples:

``` Python
# Assign a list of letters 'a' to 'f' to a variable "letters"
letters = ['a', 'b', 'c', 'd', 'e', 'f']

# Access the letter 'a' in the list
letters[0]

# Access the letter 'd' in the list
letters[3]

# Access the letter 'd' to 'f' in the list
letter[3:6]
```

A key characteristic of a **list** is the order of included components. The order allows us to access its components by a position. **Dictionaries** (dict), on the other hand, are unordered sets of components. We access components by their unique **key**. 

Here are some examples:

``` Python
# Define and print a dict:
x = ['Tom', 'Peter', 'Nancy']
y = [19, 24, 20]
z = [False, True, True]
person_dict1 = dict(name = x, age = y, college = z)
print(f'person_dict1: \n{person_dict1}\n')

# Another way to define the dict:
person_dict2 = {'name': x, 'age': y, 'college': z}
print(f'person_dict2: \n{person_dict2}\n')

# Check data type:
print(f'data type: {type(person_dict1)}\n')

# Access 'age':
ages = personal_dict1['age']
print(f'ages: {ages}\n')

# Access 'age' of Peter
peter_age = person_dict1['age'][1]
print(f"Peter's age: {peter_age}\n")

# Add 2 years to Peter's age and change his college status
person_dict1['age'][1] = person_dict1['age'][1] + 2
person_dict1['college'][1] = False
print(f'person_dict1: \n{person_dict1}\n')

# Add a new variable 'income':
person_dict1['income'] = [100, 250, 200]

# Delete variable 'age':
del person_dict1['age']
print(f'person_dict1: \n{person_dict1}\n')
```

There are many more important data types and we covered only the ones that most relevant for the topics discussed in this document. **Table 1.2** summarizes these built-in data types plus a simple example in case you have to look them up later.

**Table 1.2**

|  Python Type  |  Data Type  |  Example  |
|  :---  |  :---  |  :---  |
|  int   |  Integer  |  x = 10  |
|  float |  Floating Point Number  |  x = 3.14  |
|  str   |  String  |  x = 'hello world'  |
|  bool  |  Boolean  |  x = True  |
|  list  |  List  |  x = [1, 3, 5, 7, 9]  |
|  dict  |  Dict  |  x = {'A': [1. 2. 3], 'B': ['a', 'b', 'c']}  |

#### Practice:

1. Create the two lists and assign them to the variable x and y respectively.

Variable: x
Values: '1/1/2020', '1/2/2020', '1/3/2020'

Variable: y
Values: 62, 67, 58

2. Create a dictionary and assign the key "Date" and "Temp" for x and y.

3. Print the temperture on 1/2/2020.

4. Change the temperture on 1/3/2020 to 60, then print the dict to check the value has updated.


In [None]:
# Create the lists here.


In [None]:
# Create the dict here.


In [None]:
# What is the temperture on 1/2/2020?


In [None]:
# Change the temperture on 1/3/2020 to 60


### 3. Modules

**Modules** are Python files that contain functions and variables. We can access these modules and make use of their code to solve different problems. The standard distribution of Python already comes with a number of built-in modules. To make use of their commands we have to import these modules first.

Here, we demonstrate how to import the **math** module and label it as an alias object. We can choose whatever name we want, but usually these aliases follow a naming convention, which could be found on the documentation of the module. After the import, functions and variables are accessed by the dot (.) syntax, which is realted to the concetp of object orientation.

Here is an example:

``` Python
# Import the math module as an alias "m"
import math as m

# Using the square root function in math module
x = m.sqrt(16)
print(f'Square Root of 16 is: {x}\n')

# Using the pi variable in math module
y = m.pi
print(f'The value of pi is: {y}\n')

# Using the Euler's number in math module
z = m.e
print(f"Euler's Number: {z}\n")
```

The functionality of Python can also be extended relatively easily by advanced users. These is not only useful to those who are able and willing to do this, but also for a novice user who can easily make use of a wealth of extensions generated by a big and active community. Since these extensions are mostly programmed in Python, everybody can check and improve the code submitted by a user, so the quality control works very well. The Anaconda distribution of Python alreayd comes with a number of external modules, also called *packages*, that we need for data analyses.

On top of the packages that come with the standard installation or Anaconda, there are countless packages available for download. If they meet certain quality criteria, they can be published on the offical "Python Package Index" (PyPI) servers at https://pypi.org/. Downloading and installing these packages is simple, which we can either run the command line or type "pip install modulenames". Of course, the installation only has to be done once per machine / user and needs an active internet connection.

Here is an example to install a package using Juypter Notebook,

``` Python
# Install wooldridge package
!pip install wooldridge

# Import package from library
import wooldridge
```

#### Practice:

Try to install the following packages using the pip install command.

- wooldridge (Data sets from Introductory Econometrics: A Modern Approach (6th ed, J.M. Wooldridge)
- numpy (NumPy is the fundamental package for array computing with Python)
- pandas (Powerful data structures for data analysis, time series, and statistics)
- pandas_datareader (Data readers extracted from the pandas codebase, shoudl be compatible with recent pandas versions)
- statsmodels (Statistical computations and models for Python)
- matplotlib (Python plotting and data visualization package)
- scipy (SciPy, Scientific library for Python)
- patsy (A Python package for describing statistical mdoels and for building design matrices)
- linearmodels (Instrumental variable and linear panel models for Python)


In [None]:
# Install the packages here


In [None]:
# Import the installed packages


#### Objects from Different Modules

As we installed the packages in the last section, we would like to discuss a little bit about the objects in these modules. We begin with the **numpy** objects. 

#### Objects in numpy

Before we start with numpy, make sure that you have the Anaconda distribution or install **numpy** as explained in the previous section. For more information about the module, see [Walt, Colbert, and Varoquaux (2011)](https://arxiv.org/pdf/1102.1523.pdf). It is standard to import the module udner the alias **np** when working with numpy, so the first line of code always is:

``` Python
import numpy as np
```

The most important data type in **numpy** is the multidimensional array (**ndarry**). We first introduce the definition of this data type as well as the basics of accessing and manipulating arrays. Second, we will demonstrate functions and methods that become useful when working on econometric problems.

To create a simple array, provide a **list** to the function **np.array**. We can also create a two dimenional array by providing multiple lists within square brackets. Instead of a two-dimensional array, we often call this data type a matrix. Matrices are important tool for econometric analyses. 

Note: Appendix D of Wooldridge (2019) introduces the basic concepts of matrix algebra.

The syntax for defining a **numpy** array is:

``` Python
array1D = np.array(list)
array2D = np.array([list1, list2, list3])
```

Within a provided list, the **numpy** array requires a homogenous data type. If we enter lists including elements of different type, numpy will convert them to a homogeneous data type.

For example, *np.array(['a', 2])*, becomes an array of strings.

Indexing one-dimensional arrays is similar to the procedure with the data type **list**. Two dimensional arrays are accessed by two comma separated values within the square brackets. The first number gives the row, the second number gives the column (starting at 0 for the first row or column. Just as with a **list**, accessing ranges of values with ":" excludes the upper limit. 

Here are some examples:

``` Python
# Import numpy module as an alias "np"
import numpy as np

# Define an array in numpy:
nparray1D = np.array([100, 3, 32.4, 5.0])
print(f'type(nparray1D): {type(nparray1D)}\n')

# Define a matrix in numpy:
nparray2D = np.array([[1, 2, 3, 4],
                     [10, 20, 30, 40],
                     [3, 6, 9, 12]])

# Get the dimension of nparray2D
dim = nparray2D.shape
print(f'Dimension: {dim}\n')

# Access elements by indices
third_elem = nparray1D[2]
print(f'Third Element: {thrid_elem}\n')

# Access element in 2nd row and 3rd column in a matrix
second_third_elem = nparray2D[1, 2]
print(f'2nd Row and 3 Column Element: {second_third_elem}\n')

# Access each row in the 2nd and 3rd column
second_to_third_elem = nparray2D[:, 1:3]
print(f'2nd and 3rd column in each row: {second_to_third_elem}\n')

# Access elements by lists
first_third_elem = nparray1D[[0, 2]]
print(f'1st and 3rd elements in an array: {first_third_elem}\n')

# Same with Boolean lists:
first_third_elem2 = boolarray[[True, False, True, False]]
print(f'1st and 3rd elements: {first_third_elem2}\n')

k = np.array([[True, False, False, False],
              [False, False, True, False],
              [True, False, True, False]])

# 1st element in 1st row, 3rd element in 2nd row, ...
elem_by_index = nparray2D[k] 
print(f'Element by index: {elem_by_index}\n')
```

**numpy** has also some predefined and useful special cases of one and two-dimensional arrays. Here are some examples:

``` Python
# Array of integers defined by the arguments start, end, and sequence length:
sequence = np.linspace(0, 2, num = 11)
print(f'Sequence: \n{sequence}\n')

# Sequence of integers starting at 0, ending at 5 - 1:
squence_int = np.arange(5)
print(f'Sequence: \n{sequence_int}\n')

# Initialize array with each element set to zero:
zero_array = np.zeros((4, 3))
print(f'Zero Array: \n{zero_array}\n')

# Uninitialized array (filled with arbitrary nonsense elements):
empty_array = np.empty((2,3))
print(f'Empty Array: \n{empty_array}/n')
```

**Table 1.3** lists important functions and methods in **numpy**, which we can apply them to the data type ndarray, but they usually work for many built-in types too. Functions are often vectorized meaning that they are applied to each of the elements separately (in a very efficient way). Methods on an object referenced by **x** are invoked by using the **x.somemethod()** syntax.  

**Table 1.3 Important numpy Functions and Methods**

|  Functions / Methods  |  Description  |
|  :---  |  :---  |
|  add(x, y) or x + y  |  Elements-wise sum of all elements in x and y  |
|  subtract(x, y) or x - y |  Elements-wise subtraction of all elements in x and y  | 
|  divide(x, y) or x / y |  Elements-wise division of all elements in x and y  | 
|  multiply(x, y) or x * y |  Elements-wise multiplication of all elements in x and y |
|  exp(x)  |  Elements-wise exponential of all elements in x |
|  sqrt(x)  |  Elements-wise square root of all elements in x  | 
|  log(x)  |  Elements-wise natural log of all elements in x  |
|  linalg.inv(x)  |  Inverse of x  |
|  x.sum() |  Sum of all elements in x  |
|  x.min() |  Minimum of all elements in x  |
|  x.max() |  Maximum of all elements in x  |
|  x.dot(y) or x@y |  Matrix multiplication of x and y  |
|  x.transpose() or x.T |  Transpose of x  |

**numpy** has a power matrix algebra system. Basic matrix algebra includes:

- Matrix addition using the operator + as long as the matrices have the same dimensions.
- The operator * does not do matrix multiplication but rather element-wise multiplciation.
- Matrix multiplication is done with the operator **@** (or the **dot method**) as long as the dimensions of the matrices are compatible or "commutative", which means number of columns in the first matrix is the same as the number of rows in the second matrix.
- Transpose of a matrix **x**: as **x.T**
- Inverse of matrix **x**: as **linalg.inv(x)**

Here are some examples:

```Python
# Define two 2D matrices in numpy:
matA = np.array([[4, 9, 8],
                 [2, 6, 3]])

matB = np.array([[1, 5, 2],
                 [6, 6, 0],
                 [4, 8, 3]])

# Use a numpy exponential function:
exp_A = np.exp(matA)
print(f'Exponential of Matrix A: \n{exp_A}\n')

# Add an element in Matrix B to all elements in Matrix A
add_A = matA + matB[[0,1]] # same as np.add(matA, matB[[0,1]])
print(f'New Matrix A: \n{add_A}\n')

# Use the transpose method:
matA_tr = matA.transpose()
print(f'Matrix A Transpose: \n{matA_tr}\n')

# Matrix algebra: matrix multiplication
matprod = matA.dot(matB)  # same as matA @ matB
print(f'Multipication of Matrix A and B: \n{matprod}\n')
```
#### Practice:

1. Create a 3 x 3 matrix with any random numbers and assign it to matA
2. Create a 2 x 3 matrix with any random numbers and assign it to matB
3. Multiply matA to matB to see if it return any error. If you get an error, why is it so?
4. Fix the error by tranposing matB, then multiply matA to matB again.

In [None]:
# Create matA and matB


In [None]:
# Multiply matA to matB


In [None]:
# Transpose matB


In [None]:
# Multiply matA to matB again


#### Objects in pandas

The module **pandas** builds on top of data types introduced in previous sections and allows us to work with something we will encounter almost every time we discuss an econometric application: a **data frame**. A data frame is a structure that collects several variables and can be thought of as a rectangular shape with the rows representing the observational units and the columns representing the variables. A data frame can contain variables of different data types (for example a numerical list, a one-dimentional ndarray, str, and so on). Before we start working with **pandas**, make sure it is installed. The standard alias of this module is **pd**, so when working with **pandas**, the first line of code always is:

``` Python
import pandas as pd
```

The most important data type in pandas is **DataFrame**, which we will often simply refer to as "data frame". One strength of pandas is the existence of a whole set of operations that work on the index of a **DataFrame**. The index contains information on the observational unit, like the person answering a questionnaire or the date of a stock price we want to work with. Accessing elements of a DataFrame object can be done in multiple ways:

- Access columns / variable by name: 
```
df['varname1] or df['[varname1', 'varname2']]
```
- Access rows / observations by integer position i to j: 
```
df[i: (j+1)] (also works with the index names of df)
```
- Access variables and observations by names: 
```
df.loc['rowname', 'colname']
```
- Access variables and observations by row and column integer position i and j: 
```
df.iloc[i, j]
```

If we define a DataFrame by a combination of several DataFrames, they are automatically matched with their indices.

Here are some examples:

``` Python
import numpy as np
import pandas as pd

# Define a pandas DataFrame:
icecream_sales = np.array([30, 40, 35, 130, 120, 60])
weather_coded = np.array([0, 1, 0, 1, 1, 0])
customers = np.array([2000, 2100, 1500, 8000, 7200, 2000])
df = pd.DataFrame({'icecream_sales': icecream_sales,
                   'weather_code': weather_coded,
                   'customers': customers})

# Define and assign an index (six ends of month starting in April, 2010)
# (details on generating indices are given in later section):
ourIndex = pd.date_range(start='04/2010', freq='M', periods=6)
df.set_index(ourIndex, inplace=True)

# Print the DataFrame
print(f'Data Frame: \n{df}\n')

# Access columns by variable names:
subset1 = df[['icecream_sales', 'customers']]
print(f'Subset 1: \n{subset1}\n')

# Access second to fourth row:
subset2 = df[1:4]  # same as df['2010-05-31':'2010-07-31]
print(f'Subset 2: \n{subset2}\n')

# Access rows and columns by index and variable names:
subset3 = df.loc['2010-05-31', 'customers']  # same as df.iloc[1,2]
print(f'Subset 3: \n{subset3}\n')

# Access rows and columns by index and variable integer positions:
subset4 = df.iloc[1:4, 0:2]  # same as df.loc['2010-05-31':2010-07-31', ['icecream_sales', 'weather']]
print(f'Subset 4: \n{subset4}\n')  
```

Many economic variables of interest have a qualitative rather than quantitative interpretation. They only take a finite set of values and the outcomes don't necessarily have a numerical meaning. Instead, they represent qualitative information. Examples include gender, academic major, grade, marital status, state, product type or brand. In some of these examples, the order of the outcome has a natural interpretation (such as the grades), in others, it does not (such as state).

As a specific example, suppose we have asked our customer to rate a product on a scale between 0 (= "bad"), 1 (= "okay"), and 2 (= "good"). We have stored the answers of our ten respondents in terms of numbers 0, 1, and 2 in a list. We could work directly with these numbers, but often, is convenient to use so-call data type **Categorical**. One advantage is that we can attach label to the outcomes. We extend a modified example, where the variable *weather* is coded and demonstrate how to assign meaningful labels. The example also includes some methods from **Table 1.4** below.

**Table 1.4 Important pandas Methods**

|  pandas Methods  |  Description  |
|  :---  |  :---  |
|  df.head()  |  First 5 observations in df  |
|  df.tail()  |  Last 5 observations in df  |
|  df.describe()  |  Descriptive statistics of df  |
|  df.set_index(x)  |  Set the index of df as x  |
|  df['x'] or df.x  |  Access column x in df  |
|  df.iloc(i, j)  |  Access variables and observations in df by integer position  |
|  df.loc(names_i, names_j)  |  Access variables and observations in df by names  |
|  df['x'].shift(i)  |  Create a by *i* rows shifted variable of x  |
|  df['x'].diff(i)  |  Creates a variable that contains the *i*th difference of x  |
|  df.groupby('x').function()  |  Apply a function to subgroups of df according to x  |

Here are some examples:

``` Python
import numpy as np
import pandas as pd

# Define a pandas DataFrame:
icecream_sales = np.array([30, 40, 35, 130, 120, 60])
weather_coded = np.array([0, 1, 0, 1, 1, 0])
customers = np.array([2000, 2100, 1500, 8000, 7200, 2000])
df = pd.DataFrame({'icecream_sales': icecream_sales,
                   'weather_code': weather_coded,
                   'customers': customers})

# Define and assign an index (six ends of month starting in April, 2010)
# (details on generating indices are given in later section):
ourIndex = pd.date_range(start='04/2010', freq='M', periods=6)
df.set_index(ourIndex, inplace=True)

# Include sales two months ago:
df['icecream_sales_lag2'] = df['icecream_sales'].shift(2)
print(f'Data frame with lag column: \n{df}\n')

# Use a pandas.Categorical object to attach labels (0 = bad; 1 = good):
df['weather'] = pd.Categorical.from_codes(codes = df['weather_coded'],
                                          categories = ['bad', 'good'])
print(f'Data frame with label column: \n{df}\n')

# Mean sales for each weather category:
group_means = df.groupby('weather').mean()
print(f'Mean sales for each weather category: \n{group_means}\n')
```

#### Practice:

Use the starter code to create a DataFrame and answer the following question using the pandas function / method.

1. What is the score for Thomas Chan?
2. Did Jack Whopper pass the exam?
3. What is Vivian's last name?
4. List the pass_exam status for Joyce, Thomas, Vivan, and Chris.
5. What is the average score for those who passed the exam?

In [None]:
# Define the DataFrame
first_name = np.array(['Jack', 'Joyce', 'Thomas', 'Vivian', 'Chris', 'Eric'])
last_name = np.array(['Whopper', 'Peyton', 'Chan', 'Kama', 'Smith', 'Rosero'])
pass_exam = np.array([True, True, True, False, True, False])
scores = np.array([80, 88, 92, 64, 75, 58])
student_df = pd.DataFrame({'first': first_name,
                   'last': last_name,
                   'pass': pass_exam,
                    'scores':scores})

In [None]:
# What is the score for Thomas Chan?


In [None]:
# Did Jack Whopper pass the exam?


In [None]:
# What is Vivian's last name?


In [None]:
# List the pass_exam status for Joyce, Thomas, Vivan, and Chris.


In [None]:
# What is the average score for those who passed the exam?


### 4. External Data

In previous sections, we entered all of our data manually in the script files. This is a very untypical way of getting data into our machine and we are introducing more useful alternatives in this section. These are based on the fact that many data sets are already stored somewhere else in data formats that Python can handle.

#### Data Sets in the Examples

We will reproduce many of hte examples from Wooldridge (2019). The companion web site for the textbook provides the sample data sets in different formats. If you have an access code that came with the textbook, they can be downlaoded free of charge. The Stata data sets are also made available online at the "Instructional Stata Datasets for Econometrics" collections from Boston College, maintained by Christopher F. Baum.

Fortunately, we do not have to download each data set manually and impor tthem by the functions discussed. Instead, we can use the external module **wooldridge**. It's not part of the Anaconda distribution and you have to install **wooldridge** as explained in the earlier section. When workking with **wooldridge**, the first line of code always is:

``` Python
import wooldridge as woo
```

The data sets from this module are pandas data type. Here is an example:

``` Python
# Load data:
wage1 = woo.dataWoo('wage1')

# Check the data type
print(f'wage1 Data Type: \n{type(wage1)}\n')

# Overview of the data set
print(f'Overview of wage1: \n{wage1.head()}\n')
```

