# Tutorial 1

## Intro to using Python for Econometrics

The goals for this lab are to learn several basic uses of statistics in python including

<ol>
<li>Loading a dataset</li>
<li>Creating a Data Frame from the dataset</li>
<li>Interacting with the Data Frame </li>
<li>Generating new columns </li>
<li>Calculating sample statistics</li>
<li>Creating a histogram</li>
</ol>

Before we start to code, we need to make sure that we have all of the python packages that we will be using
for this lab. You may need to do little bit of work to get everything set up.

When you are coding youself, you will often find yourself adding packages as you go. Just make
sure they all go at the top!



In [7]:
import pandas as pd
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt


As you can see we have added 4 things. Each of them you can google for more information, but here is a quick over view:

Pandas: Used for "data frames" often shortened to df. You can think of a df as a table or excel sheet with rows and columns

Numpy: Adds additional mathematical and statistical functions

Scipy: More math and stats

Matplotlib: Graphing in python

# Loading a dataset

Here, I am creating a datafame df from the data provided in lab1.

You can use the same URL to get the data off of the github repo, or you can use the same command for a local filepath.

Remember, we imported pandas as pd and we are trying to load a .dta file. Luckily pandas has a function for reading stata files.

Pandas also has pd.read_excel which works the same way.

Note that the df on the right side of the equals sign is simply the name of the data frame

Typing the dataframe's name and running the code prints the a preview of the dataframe.

In [8]:
df = pd.read_stata("https://github.com/Bauer22/Intro-To-Econometrics-In-Python/blob/main/Datasets/Lab1.dta")
df

ValueError: Version of given Stata file is 10. pandas supports importing versions 105, 108, 111 (Stata 7SE), 113 (Stata 8/9), 114 (Stata 10/11), 115 (Stata 12), 117 (Stata 13), 118 (Stata 14/15/16),and 119 (Stata 15/16, over 32,767 variables).

# Generating a New Column Variable

In the context of dataframe we need to add an entire new column to have a similar functionality to
variables in Stata. In the context this is referred to as a column or a label.

To create a new label, you simply need to use the syntax: datafram["Label"] and set it equal to something.

In this case lets create a label that represents the percent of classes a student attended.

From the df command, you can scroll to the right and see that there is a new column.

In [None]:
# Create a new column with label attendenceRate
# df.attend allows us to access the column with the label "attend"
df['attendenceRate'] = (df.attend/32) *100
df

# Summary Statistics

This is a little different in python than in stata. The .describe() function will give you most of
the same information as stata.

If you need any other information you can quickly look it up on google. As you will see for correlation, the
naming conventions are pretty intuitive.

In [None]:
df['attendenceRate'].describe()

So now lets find the correlation. df.corr() will print all the correlations.

We can also calculate a specific correlation by selecting two columns as seen below

In [None]:
df.corr()

In [None]:
df['attendenceRate'].corr(df['final'])

Covariance is very similar as seen below.

In [None]:

df.cov()

In [None]:
df['attendenceRate'].cov(df['final'])


# Graphing

## Histogram
For graphing in python, we will be using matplotlib.

You will probably be able to copy and paste the code from these labs and change the parameters as needed.

I have commented the below code for your understanding

In [None]:
#Find the mean and standard deviation for the values in the label final
mu, std = norm.fit(df['final'])
#Create and define the graphs parameters
plt.hist(df['final'], bins=20, density=True, alpha=0.6, color='g', edgecolor = 'black')
# Create bounds for the x by finding the min and max
xmin, xmax = plt.xlim()
# Create a nice x axis label by making a line between the bounds, spaced by 100
x = np.linspace(xmin, xmax, 100)
# Create a normal distribution line
p = norm.pdf(x, mu, std)
# Plot the graph with all of the things we generated above
plt.plot(x, p, 'black', linewidth=2)
# Actually print the graph
plt.show()


