## Intro to Python
### In today's lecture, we are going to learn some basic programming in python and get to know more about it.  At the end of this section, you should 

<ol>
    <li>be able to use Python to create varible</li>
    <li>be able to use Python to do basic mathematical operation</li>
    <li>be able to use Python to load data from CSV or Excel file</li>
    <li>be able to print out statis scores and summary</li>
    <li>be able to use the graphing tools for ploting the data</li>
    <li>be able to use the basic code to run simple regression analysis</li>
</ol>

## Define Your Data and Formatting It
### There are five main types of data you will be dealing with in your program.

<ol>
    <li>Integers, e.g. 1, 10, 302, -3, -100</li>
    <li>Floating-Point Number, e.g. 1.33, 0.2, 100.99</li>
    <li>Complex Numbers, e.g.(2+5x), 40**x, 1/5</li>
    <li>Strings, e.g. Norman, Hello World, I'm a string</li>
    <li>Boolean, True/False, 1 or 0</li>
<ol>

In [None]:
# Create variable of x that takes a value of 5
x = 5
print(type(x))

In [None]:
# Create variable of y that takes a value of 5.55
y = 5.55
print(type(y))

In [None]:
# Create a complex number z, where z = x+y(j)
z = complex(x, y)
print(type(z))

In [None]:
# Create a string "Hello World"
w = "Hello World"
print(type(w))

In [None]:
# Create a boolean (sometime we called this "binary choice" in economics)
s = bool(x==y)
print(s)
print(type(s))

## How do we classify the format of your data in Python?
### There are three major type of data formats
<ul>
    <li>List, an array of data that is "mutable"</li>
    <li>Tuple, an array of data that is "immutable"</li>
    <li>Dictionary, data with keys and values defined</li>
</ul>   

In [None]:
# Create a list []
x = [1, 2, 3, 4, 5]
print(x)

In [None]:
# Retreiving the first value in the array
x[0]

In [None]:
# Check the type of the first value in the array
print(type(x[0]))

In [None]:
# Change the first value in the array to 2
x[0] = 2

In [None]:
# Print the array x again after the change
print(x)

In [None]:
# Check the type of the array
print(type(x))

In [None]:
# Create a tuple ()
y = (1, 2, 3, 4, 5)
print(y)

In [None]:
# Retreiving the first value in the array
y[0]

In [None]:
# Check the type of the first value in the array
print(type(y[0]))

In [None]:
# Change the first value in the array to 2
y[0] = 2

In [None]:
# Check the type of the array
print(type(y))

In [None]:
# Create a dictionary {"key": "values"}
z = {"name": ["Norman", "Eric", "Winson"],
    "age": [37, 40, 36],
    "income": [100, 500, 10]}
print(z)

In [None]:
# Extracting the data from a dictionary

# Printing all the names
print(z["name"])

# Printing the age for Norman
print(z["age"][0])

# Printing the income for Winson
print(z["income"][2])


In [None]:
# Check the type of dictionary z
print(type(z))

#### It is worth to take note that the counting machanism in Python and many other programming languages start with "0".  However, the count in R and VBA (Excel programming languange) start from "1".

In [None]:
# Looping through an array of data
for i in x:
    print(i)

In [None]:
# Looping through an array of data and add 10 to each value
for i in x:
    print(i + 10)

In [None]:
# Looping throught dictionary z
for i in z:
    print(i)

In [None]:
# Looping through the key "name" in dictionary z
for i in z["name"]:
    print(i)

## Loading Data from the Existing Data File (CSV & Excel)
### There are different ways to open a CSV or Excel file in Python.  One of the most popular libraries in the community is Pandas.  Let's take a quick look of how to read the CSV or Excel file with Pandas.
### "Library" is a build in modules written in higher level (usually C or C++) that provide access to system functionality, so Python user do not need to hard coding everything on their own.

In [None]:
# If your computer has not install Pandas library before, you will need to pip install the library to your Anaconda envorinment
# !pip install pandas

In [None]:
# If you have installed Pandas before, then you can just import the library from Anaconda environment directly
import pandas as pd

In [None]:
# Create a path to the CSV file
file_path = "data/donors2008.csv"

In [None]:
# Extract the data from the CSV file to a Pandas data frame
# The "encoding" parameter is for situation where you may deal with different languages or rare syntex in your CSV file.
# Some standard Python encoding includes, "ISO-8859-1"(West Europe), "utf_8"(All), "big5"(Chinese), etc.
data_df = pd.read_csv(file_path, encoding="ISO-8859-1")
data_df

In [None]:
# Create a new path to the Excel file
file_path = "data/crime_data.xlsx"

In [None]:
data_df = pd.read_excel(file_path, encoding="utf_8")
data_df.head()

In [None]:
# You can print the selected column in the data frame
print(data_df["year"])

In [None]:
# You can also print specific row in the data frame [0:1] from row 0 to 1, not include 1
print(data_df[0:1])

## Mathematic Operations
### Here is a list of mathematic operations you can use in Python
<ol>
    <li>Addition/Subtraction:  x + y  and  x - y</li>
    <li>Multiplication/Division:  x * y  and  x / y</li>
    <li>Exponent/Power:  x**y</li>
    <li>Modulus:  x%y, read as x mod y</li>
</ol>
   

In [None]:
x = 3
y = 10

In [None]:
x + y

In [None]:
x - y

In [None]:
x * y

In [None]:
x / y

In [None]:
x**y

In [None]:
x%y

In [None]:
y%x

## Statistic in Python
### In this section, we are going to go over the two fundamental libraries for scientific application.
<ul>
    <li>Numpy: A high-performance multidimensional array object, and tools for working with these arrays.</li>
    <li>Scipy: A high-performance multidimensional array and basic tools to compute with and manipulate these arrays.</li>
</ul>

### Note that in this section, we are going to compare the results in Stata and R

In [None]:
# Installing the dependency
# !pip install numpy
import numpy as np

In [None]:
# Create a numpy array (only rank 1 array) and check for its type
d = np.array([10, 20, 30])
print(type(d))

In [None]:
# Extract value in the first position of the array
print(d[0])

In [None]:
# Create a 3x3 matrix (rank 3 arrays) with numpy
A = np.array([[1, 2, 3], [-4, -5, -6], [7, -8, 9]])
print(type(A))

In [None]:
# Print the 3x3 matrix
print(A)

In [None]:
# Extract value from the matrix as it was an array
print(A(0))

# note that numpy cannot recongize the command because it has multiple ranking in the array

In [None]:
# Extract the a11 position from the 3x3 matrix z
print(A[0,0])

In [None]:
# Find the determinant for Matrix A
det = np.linalg.det(A)
print(det)

In [None]:
# Solving a system of equation as we learned in class
# A is parameter matrix, d is constant matrix, and x is variable matrix
x = np.linalg.solve(A, d)

In [None]:
# Print the variable matrix x
print(x)

In [None]:
# Check the solution
np.allclose(np.dot(A, x), d)

In [None]:
# For learning purpose, we are not going to cover too much on matrix calculation with Numpy
# Let's consider how to get the statistic parameters in Numpy
# Let's try on getting the min, max, range, mean, median, variance, and standard deviation for a Numpy array.

# Using the Numpy function random.randint to generate an new array with 10 elements, which value range from 0 - 100.
w = np.random.randint(101, size=10)

# Minimum value in array w
np.amin(w)

# Maximum value in array w
np.amax(w)

# Range of array w, maximum value - minimum value in an array
np.ptp(w)

# 90th Percentile of array w
np.percentile(w, 90)

# Median of array w
np.median(w)

# Mean / Average of array w
np.average(w)

# Note that Numpy library do not have mode function, which we will cover in Scipy

# Variance of array w
np.var(w)

# Standard deviation of array w
np.std(w)


In [None]:
# If we are trying to get mode of an array or a descriptive statistic summary of an array,
# we will need to take a look of this extremely useful library called "Scipy"

from scipy import stats

# Mode of array d
stats.mode(w)

In [None]:
# If we want to have a statistic summary of an array d
stats.describe(w)
# (n, (min, max), mean, var, skew, kurt) = stats.describe(w)

In [None]:
print("Statistic Summary:")
print("-----------------------------")
print(f"Number of Oberservation: {n}")
print(f"Minimum: {min}")
print(f"Maximum: {max}")
print(f"Mean: {mean}")
print(f"Variance: {var}")
print(f"Skewness: {skew}")
print(f"Kurtosis: {kurt}")


In [None]:
# Read in data
general_heights = pd.read_csv("./data/general_heights.csv")
wba_heights = pd.read_csv("./data/wba_data.csv")

In [None]:
general_heights.head()

In [None]:
wba_heights.head()

In [None]:
# Using Scipy to get a summary statistics on "height" from general height data set.

(n, (min, max), mean, var, skew, kurt) = stats.describe(general_heights["height"])

In [None]:
print("Statistic Summary:")
print("-----------------------------")
print(f"Number of Oberservation: {n}")
print(f"Minimum: {min}")
print(f"Maximum: {max}")
print(f"Mean: {mean}")
print(f"Variance: {var}")
print(f"Skewness: {skew}")
print(f"Kurtosis: {kurt}")

In [None]:
# Using Scipy to get a summary statistics on "height" from wba data set.



In [None]:
# Print the statistic summary



## Simple Statistic Plot

### Python has two popular plotting libraries.

* Matplotlib, [Matplotlib Examples](https://matplotlib.org/gallery/index.html)
* Poltly, [Plotly Examples](https://plot.ly/python/)

In [None]:
# Dependencies
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import stats

# A common plot in statistic is frequency distribution. 
# Let's try to use Matplotlib to plot the sampling distribution for general height data set.

plt.hist(general_heights["height"])

# Defining the axes, title, limit of axes, and grid
# plt.xlabel("Heights")
# plt.ylabel("Frequency")
# plt.title("Frequency Distribution (General Heights)")
# plt.xlim(50, 85)
# plt.ylim(0, 110)
# plt.grid(True)
plt.show

In [None]:
# Let's try to use Matplotlib to plot the ssample distribution for wba data set



In [None]:
# Another common plotting tool is scatter plot for two or three variables

# Read in data
health = pd.read_csv("./data/food_env_data.csv")

health.head()

In [None]:
# For convenience, we are storing the two variables "Percent Diabetes" and "Percent Obesity" into two separate columns

diabetes = health["Percent Diabetes"]
obesity = health["Percent Obesity"]

In [None]:
# We are going to use a scatter plot to visualize the two variable in one plot

plt.scatter(obesity, diabetes, color="red", marker='o')

# Defining the axes, title, limit of axes, and grid
# plt.xlabel("Percent of Obesity (County Level)")
# plt.ylabel("Percent of Diabetes (County Level)")
# plt.title("Diabetes vs. Obesity (County Level)")
# plt.xlim(2, 25)
# plt.ylim(10, 50)
# plt.grid(True)
plt.show

## Simple Statistic Functions and Regression Modeling
### In this section, we are going to go through some very basic statistic testing and regression modeling syntex.  You are going to spend majority of your time to learn different regression models in the M.S. program, so I am not going to cover too much of this materials.
* t-test, with Scipy [Documentations](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html)
* F-test, with Scipy [Documentations](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html)
* Regression Model, with Scipy [Documentation](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.linregress.html)

In [None]:
# Let's use our general height and wba height example
# Keep in mind, the default is set to be a 2-sided test
# This example is using independent research design  "stats.ttest_ind"
(t_stat, p) = stats.ttest_ind(general_heights["height"], wba_heights["height"])

In [None]:
print(f'General Heights Average: {general_heights["height"].mean()}')
print(f'WBA Heights Average: {wba_heights["height"].mean()}')
print(f't_statistic: {t_stat}')
print(f'Two-Tailed P-Value: {p}')


In [None]:
# Another common statistic test is F-test
# F-test is also called "ANOVA", Analysis of Variances.
x = [1, 2, 8, 10, 3, 5, 13, 12, 6, 8]
y = [4, 9, 2, 15, 10, 9, 9, 5, 12, 14]
z = [12, 15, 10, 5, 4, 7, 11, 11, 3, 2]

(F_stat, p) = stats.f_oneway(x, y, z)

In [None]:
print(f'x Average: {np.average(x)}')
print(f'y Average: {np.average(y)}')
print(f'y Average: {np.average(z)}')
print(f'F_statistic: {F_stat}')
print(f'P-Value: {p}')

In [None]:
# Finally, let's see how to do simple Regression Model in Python with Scipy
# Using the diabetes and obesity example.

# Often economist use correlation matrix to observe any correlation between variables
np.corrcoef(diabetes, obesity)

In [None]:
# Using Scipy, we can also build a simple single variable regression model
m_slope, m_int, m_r, m_p, m_std_err = stats.linregress(obesity, diabetes)
m_fit = m_slope * obesity + m_int


In [None]:
print(m_slope)
print(m_int)
print(m_p)
print(m_std_err)
print(m_r)

In [None]:
# Using Matplotlib to visualize the result

plt.scatter(obesity, diabetes, color="green", marker='o')
plt.plot(obesity, m_fit, "b--", linewidth=2)

# Defining the axes, title, limit of axes, and grid
# plt.xlabel("Percent of Obesity (County Level)")
# plt.ylabel("Percent of Diabetes (County Level)")
# plt.title("Diabetes vs. Obesity (County Level)")
# plt.xlim(10, 50)
# plt.ylim(2, 25)
# plt.grid(True)
plt.show

## Cleaning and Manipulating the Data
### Economists or Data Scientists usually spend 80% of the time to clearn and manipulating the data.  The analytical part only takes 20% of the time.  It is important to at least learn the basic of it.

In [None]:
# Drop any row that has at least one N/A or empty data
health.dropna()

# Drop the columns where at least one element is missing
health.dropna(axis="columns")

# Drop the rows where all elements are missing
health.dropna(how="all")



In [None]:
# Adding/Subtracting two columns of data
health["Total Survey"] = health["Survey Diabetes"] + health["Survey Obesity"]

# Multiplying/Dividing two columns of data
health["Total Diabetes"] = health["Survey Diabetes"] * health["Percent Diabetes"] / 100

# Taking log of the column
health["Log Percent Obesity"] = log(health["Percent Obesity"])

# Exponantial the column
helath["Exp Log Percent Obesity"] = exp(health["Log Percent Obesity"])