## Introduction to Python

This is the first Notebook of the Course. It will be used to discuss introductory aspects of Python and Jupiter Notebook 

In [None]:
# This cell contains some Python instructions
# These first lines are just comments and are not interpreted by Python
# To run the content of a cell you just need to select the cell and click the "play" icon on top


# Let's use a simple print statement: 

print("Hello Gent!")

As you can see, once the source code in a given cell is executed, the corresponding output is displayed right below the cell.  

### Python Packages

Similar to **R software** in which users can load packages to carry out specialized operations
for the task at hand, Python contains a series of packages that users can load
into Python as needed. Common packages that you will see over and over include
**Numpy (“Numerical Python”)**, **Pandas (“Python Data Analysis Library” - "panel datasets")**, **Matplotlib ("MATLAB-like plotting library")**, **SciPy ("Scientific Python" - scientific computation library involving numpy)**, and many others. The good news is that the Anaconda distribution comes “pre-packaged” with many of these packages. Packages can be easily “imported” for use by the simple command import.
For example, let’s import numpy:

In [None]:
import numpy

While the above will successfully import **numpy**, when we use numpy, we typically need to append the numpy name to the object we are working with. For example, suppose we wished to use numpy to sum a set of values for a variable. We will first give Python some data to work with and construct the *variable (or vector) x*:

In [None]:
x = numpy.array([0, 1, 2, 3, 4, 5])
print(x)

In [None]:
# Sum of the vector
numpy.sum(x)

We see that Python has successfully added the numbers in the vector using numpy.sum(). However, instead of having to write out “numpy” each time, wouldn’t it be easier to simply abbreviate it to a shorter label? This is exactly what most programmers do. That is, instead of importing numpy as itself, it is instead imported as **“np”**:

In [1]:
import numpy as np
x = np.array([0, 1, 2, 3, 4, 5])
np.sum(x)

15

In [2]:
import numpy as iusenumpyatgent
iusenumpyatgent.sum(x)

15

In [None]:
import numpy as np

N.B. A suggestion: for all Python packages, there are far too many functions to summarize here. Doing good
data analysis means having a book or library (or **Internet connection**) available to you
at all times so you can dig up code and functions as you require them. 

**Memorizing
code, or attempting to learn all code, is a waste of your time and will be a goal
forever unrealized. Learning how to *problem-solve* given novel data situations
is the skill you actually need to learn.**

Have you ever noticed that your
mechanic does not always have the immediate answers, but knows where to look?
Likewise, your veterinarian may not immediately know the problem with your dog,
but a good veterinarian will know the path to pursue to finding the answer.

Understanding science and associated statistics
is much more difficult to “dig up” as your knowledge requirements increase.

In [None]:
#created an array of data, which can be easily produced by np.array():
x = np.array([1, 5, 8, 2, 7, 4])
x

In [None]:
# Suppose you then want to know the value of the 4th element of the array
x[4]

In [None]:
# Join or concatenate arrays in numpy
y = np.array([4, 6, 8, 2, 4, 1])
y

In [None]:
np.concatenate([x, y])

In [3]:
np.concatenate([x, y], axis = 0)

NameError: name 'y' is not defined

https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html

### Installing a New Package in Python

In [4]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


Since pandas comes with Anaconda already, the system reports it as
“already satisfied.”

In [12]:
# We can also import pandas now
import pandas as pd

#### Series
Series is a **one-dimensional labeled array (a vector)** capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

> df = pd.Series(data, index=index)

where __df__ is the dataframe we want to create from the data and __index__ is the name of the label we want to assign to the samples/objects.

The data can be actually whatever we want:
 - a Python dict (dictionary)
 - an ndarray (n-dimensional array)
 - a scalar value (like 5)

In [13]:
# in case of ndarray
vec = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
vec

a   -0.692636
b   -1.683630
c   -0.229301
d   -1.617286
e   -0.230238
dtype: float64

In [8]:
vec.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [9]:
vec.values

array([-0.20814864, -0.40071386, -0.7881478 ,  0.79787289, -1.28945044])

In [14]:
# in case of Python dictionary (dict)
mydict = {"b": 1, "a": 0, "c": 2}

In [37]:
vec = pd.Series(mydict)
vec

b    1
a    0
c    2
dtype: int64

In [38]:
# If an index is passed, the values in data corresponding to the labels in the index will be pulled out.
pd.Series(vec, index=["b", "c", "d", "a"])

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

**NaN (not a number)** is the standard missing data marker used in pandas.

In [43]:
# you can find the value of your sample like:
vec["c"]

2

In [44]:
# you can do operations on series (as well as NumPy arrays)
vec + vec

b    2
a    0
c    4
dtype: int64

In [45]:
vec*3

b    3
a    0
c    6
dtype: int64

In [46]:
np.exp(vec)

b    2.718282
a    1.000000
c    7.389056
dtype: float64

In [28]:
# you can assign a name to the vector too
vec = pd.Series(np.random.randn(5), name="my_nice_vector")

In [34]:
vec

0    0.176361
1    0.820908
2    0.816134
3    0.630828
4   -0.548989
Name: my_nice_vector, dtype: float64

#### DataFrame
DataFrame is a **2-dimensional labeled data structure (a matrix)** with *columns of potentially different types*. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

 - Dict of 1D ndarrays, lists, dicts, or Series

 - 2-D numpy.ndarray

 - Structured or record ndarray

 - A Series

 - Another DataFrame

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame.

In [48]:
# create your first DataFrame
data = {
    "one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
    "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
}


In [31]:
data

{'one': a    1.0
 b    2.0
 c    3.0
 dtype: float64,
 'two': a    1.0
 b    2.0
 c    3.0
 d    4.0
 dtype: float64}

In [49]:
df = pd.DataFrame(data)

In [50]:
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


Nice, right?!

In [51]:
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [52]:
df.columns

Index(['one', 'two'], dtype='object')

In [53]:
# we can slice them by row
pd.DataFrame(df, index=["d", "b", "a"])

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


In [54]:
# we can slice them also by column
pd.DataFrame(df, index=["d", "b", "a"], columns=["two", "three"])

Unnamed: 0,two,three
d,4.0,
b,2.0,
a,1.0,


In [55]:
# we can make operations
df["three"] = df["one"] * df["two"]
df

Unnamed: 0,one,two,three
a,1.0,1.0,1.0
b,2.0,2.0,4.0
c,3.0,3.0,9.0
d,,4.0,


In [56]:
# we can remove columns
del df["three"]
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


### Import a file

In [None]:
filename = "C:\Users\Eugenio_Py\Desktop\Notebooks\datasets\iris.csv"

In [None]:
filename = "C:/Users/Eugenio_Py/Desktop/Notebooks/datasets/iris.csv"

In [None]:
# import a .csv file
df = pd.read_csv(filename, sep = ";", header = 0, index_col = 0)

 - _sep_ = "separator"
 - _header_ = column names are located in the x row
 - _index_col_ = sample names are located in the x column

In [None]:
df

In [None]:
# get some information on the dataset
df.info()

 - **float64** is a *double-precision floating-point format*, a computer number format occupying 64 bits in computer memory;
 - **object** represents structured arrays in which the columns are identified with labels.

In [None]:
df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df2

In [None]:
df2.dtypes

In [None]:
# Convert a variable to categorical
df.variety = df.variety.astype('category')

In [None]:
df.variety

In [None]:
# view the first 6 rows
df.head()

In [None]:
# view the last n
df.tail(3)

In [None]:
print(df)

In [None]:
# view the dimensions
df.shape

In [None]:
# convert to numpy array
df.to_numpy()

In [None]:
# transpose the data
df.T

In [None]:
# obtain a quick statistic summary of the data (only on the numerical data)
df.describe()

In [None]:
# sorting
df.sort_values(by="sepal.length", ascending=True)

In [None]:
# slicing (by rows)
df["sepal.length"]

In [None]:
df.variety

In [None]:
df[0:2]

In [None]:
df["s_1":"s_10"]

In [None]:
# selection by label (loc)
df.loc["s_1"]

In [None]:
# select rows and columns
df.loc["s_1":"s_10",["sepal.length","variety"]]

In [None]:
# select by position (iloc)
df.iloc[3:5,:]

In [None]:
df.iloc[2:4,0:2]

In [None]:
df.iloc[[1, 2, 4], [0, 2]]

In [None]:
# Filtering
df[df["sepal.length"] > 5.0]

### Missing data

If there are missing values in the Dataset before doing any statistical analysis, we need to handle those missing values.

There are mainly three types of missing values.

 - __MCAR (Missing completely at random)__: These values do not depend on any other features --> mean value
 - __MNAR(Missing not at random)__: These missing values have some reason for why they are missing --> (a random number below 0.5 * limit of detection)

In [None]:
# dictionary of lists
dict = {'A':[100, 90, np.nan, 95],
        'B': [30, 45, 56, np.nan],
        'C':[np.nan, 40, 80, 98]}
  
# creating a dataframe from list
df1 = pd.DataFrame(dict)

In [None]:
df1

In [None]:
#To drop any rows that have missing data:
df1.dropna(how="any")

In [None]:
#Filling missing data with a random number below 0.5 * limit of detection
import random
lod = 6
df1.fillna(value=random.uniform(0.1*lod, 0.5*lod))

In [None]:
# Fillng missing data with the mean of a column
df1.mean()

In [None]:
df1.fillna(df1.mean())

In [None]:
#To get the boolean mask where values are nan:
pd.isna(df1)

### Central limit theorem (CLT)
Starting from a **binomially-distributed population**.

The following represents our Binomially distributed population (recall that Binomial is a discrete distribution and hence we produce the probability function below):
![image-3.png](attachment:image-3.png)

 - k = number of successes;
 - p = success probability of each trial;
 - x = number of trials
 
k successes occur with probability p^k and (x−k) failures occur with probability (1−p)^(n−k).
 
We plot the sampling distribution obtained with a large sample size (**n = 500**) for a Binomially-distributed variable with parameters **k=30** and **p=0.9**

For the plots, we will use the **seaborn** library 

In [None]:
import seaborn as sns

In [None]:
sample_size=500

df500 = pd.DataFrame()

for i in range(1, 51):
    exponential_sample = np.random.binomial(30,0.9, sample_size)
    col = f'sample {i}'
    df500[col] = exponential_sample

df500

In [None]:
# Plotting the sampling distribution from a 
df500_sample_means_binomial = pd.DataFrame(df500.mean(),columns=['Sample means'])
df500_sample_means_binomial

In [None]:
sns.distplot(df500_sample_means_binomial);

For this example, as we assumed that our population follows a Binomial distribution with parameters k=30 and p=0.9, Which means if CLT were to hold, the sampling distribution should be approximately normal with mean = 27 and standard deviation = 0.0734 (**see formula above**)

In [None]:
# Mean of sample means is close to the population mean
df500_sample_means_binomial.mean().values[0]

In [None]:
# Standard deviation of sample means is close to population standard deviation divided by square root of sample size
df500_sample_means_binomial.std().values[0]

### Univariate histograms

A histogram is a **bar plot** where the axis representing the data variable is divided into a set of discrete bins and the count of observations falling within each bin is shown using the height of the corresponding bar:

In [None]:
sns.histplot(df, x="sepal.length");

In [None]:
penguins = sns.load_dataset("penguins")
penguins

In [None]:
sns.histplot(penguins, x="body_mass_g");

The __size of the bins__ is an important parameter, and using the wrong bin size can mislead by obscuring important features of the data or by creating apparent features out of random variability. By default, the function choose a default bin size based on the variance of the data and the number of observations.

In [None]:
sns.histplot(penguins, x="body_mass_g", binwidth=300);

In [None]:
sns.histplot(df, x="sepal.length", hue="variety");

In [None]:
sns.histplot(penguins, x="body_mass_g", hue="species", element="step");

In [None]:
sns.histplot(penguins, x="body_mass_g", hue="species", multiple="stack");

In [None]:
sns.histplot(penguins, x="flipper_length_mm", hue="sex");

In [None]:
sns.histplot(penguins, x="flipper_length_mm", hue="sex", multiple="dodge");

In [None]:
sns.displot(penguins, x="flipper_length_mm", col="sex");

In [None]:
sns.displot(df, x="sepal.length", col="variety");

In [None]:
sns.displot(df, x="sepal.length", kde = True);

### Kernel density estimation
A histogram aims to approximate the underlying probability density function that generated the data by binning and counting observations. Kernel density estimation (KDE) presents a different solution to the same problem. Rather than using discrete bins, a KDE plot smooths the observations with a Gaussian kernel, producing a continuous density estimate:

In [None]:
sns.displot(penguins, x="flipper_length_mm");

In [None]:
sns.displot(penguins, x="flipper_length_mm", kind="kde");

In [None]:
# Choosing the smoothing bandwidth like with the bin size in the histogram
sns.displot(penguins, x="flipper_length_mm", kind="kde", bw_adjust=.25);

In [None]:
sns.displot(penguins, x="flipper_length_mm", kind="kde", bw_adjust=2);

In [None]:
sns.displot(penguins, x="flipper_length_mm", hue="species");

In [None]:
sns.displot(penguins, x="flipper_length_mm", hue="species", kind="kde");

In [None]:
sns.displot(penguins, x="flipper_length_mm", hue="species", kind="kde", multiple="stack");

In [None]:
sns.displot(penguins, x="flipper_length_mm", hue="species", kind="kde", fill=True);

### Piechart

In [None]:
# Plotting the pie chart for df
df.groupby('variety').size().plot(kind='pie', textprops={'fontsize': 20},
                                  autopct='%.2f',
                                  colors=['tomato', 'gold', 'skyblue']);

In [None]:
penguins.groupby('species').size().plot(kind='pie', textprops={'fontsize': 20},
                                  autopct='%.2f',
                                  colors=['tomato', 'gold', 'skyblue']);

### Barplot

In [None]:
#Let's calculate the percentage of each job status category.
df.variety.value_counts(normalize=True)

In [None]:
import matplotlib.pyplot as plt
#plot the bar graph of percentage job categories
df.variety.value_counts(normalize=True).plot.barh()
plt.show()

In [None]:
#plot the bar graph of percentage job categories
df.variety.value_counts(normalize=True).plot.bar()
plt.show()

In [None]:
sns.catplot(x="variety", kind="count", palette="ch:.25", data=df);

### Boxplot

In [None]:
df.describe()

In [None]:
df['sepal.length'].describe()

In [None]:
sns.catplot(y="sepal.length", kind="box", data=df);

In [None]:
df['sepal.length'].median()

In [None]:
df.groupby(["variety"])[["sepal.length"]].describe()

In [None]:
sns.catplot(x="variety", y="sepal.length", kind="box", data=df);

In [None]:
penguins

In [None]:
penguins.info()

In [None]:
# if we have more categorical variables
sns.catplot(x="species", y="body_mass_g", hue="sex", kind="box", data=penguins);

### Violin plots
Combining boxplots with KDE

In [None]:
sns.catplot(x="variety", y="sepal.length",kind="violin", data=df);

In [None]:
sns.catplot(x="species", y="body_mass_g", hue="sex",
            kind="violin", data=penguins);

In [None]:
sns.catplot(x="species", y="body_mass_g",
            col="sex",kind="box", data=penguins);

### Operations on Python dataframe

In [None]:
# mean values
df.mean()

In [None]:
# get a summary
df.describe()

In [None]:
df['sepal.length'].std()

### Empirical cumulative distribution functions (ecdf)
An ECDF represents the proportion or count of observations falling below each unique value in a dataset.

In [None]:
sns.ecdfplot(data=df, x="sepal.length");

In [None]:
# flip the plot
sns.ecdfplot(data=df, y="sepal.length");

In [None]:
sns.ecdfplot(data=df, x="sepal.length", hue="variety");

In [None]:
#It’s also possible to plot the empirical complementary CDF (1 - CDF):
sns.ecdfplot(data=df, x="sepal.length", hue="variety", complementary = True);

### Shapiro-Wilk test for normality

We can conduct a statistical test, the Shapiro-Wilk test, to check for normality of data. It is the most powerful test when testing for a normal distribution.

If the p-value is larger than 0.05, and we assume a normal distribution with 95% confidence, else we reject the null hypothesis of normality.

We use the *shapiro* function from the scipy library:

In [None]:
df = df.dropna(how="any")

In [None]:
df

In [None]:
data = df["petal.length"].to_numpy()
data

In [None]:
# in case of normal distributions
from scipy.stats import shapiro

#generate dataset of 100 random values that follow a standard normal distribution
from numpy.random import randn
data = df["sepal.length"].to_numpy()
stat1, p1 = shapiro(data)

if p1 < 0.01:
    test_result1 = 'NOT NORMAL'
    print(f'The p-value is {p1}, hence we reject that the data is normally distributed with 95% confidence')
else:
    test_result1 = 'NORMAL'
    print(f'The p-value is {p1}, hence we can assume that the data is normally distributed with 95% confidence')

In [None]:
# in case of non-normal distributions
from numpy.random import poisson
data = poisson(5, 100)
shapiro(data)

stat1, p1 = shapiro(data)

if p1 < 0.01:
    test_result1 = 'NOT NORMAL'
    print(f'The p-value is {p1}, hence we reject that the data is normally distributed with 95% confidence')
else:
    test_result1 = 'NORMAL'
    print(f'The p-value is {p1}, hence we can assume that the data is normally distributed with 95% confidence')

### Kurtosis and skewness

![image.png](attachment:image.png)

In [None]:
from scipy import stats

In [None]:
#calculate sample skewness
data = df["sepal.length"].to_numpy()
stats.skew(data, bias=False)

In [None]:
#calculate sample kurtosis
stats.kurtosis(data, bias=False)

### Student's t-test

In [None]:
# import necessary functions
import random
from numpy.random import seed
from numpy.random import randn
from scipy.stats import ttest_ind
from scipy.stats import t

In [None]:
seed(10)
df=pd.DataFrame({"female":np.random.randint(10, 100, size=10),"male":np.random.randint(10, 140, size=10)})
print(df)

![image.png](attachment:image.png)

In [None]:
se_male=df.std()['male']/np.sqrt(10)
se_female=df.std()['female']/np.sqrt(10)
sed=np.sqrt((se_male**2) + (se_female**2))
t_stat=(df.mean()['male'] - df.mean()['female'])/sed
print(t_stat)

Now Having the t-statistic we have to find the critical number at the t-statistic table. It can be a one-tailed test because we want to test if the number of likes of males is bigger than the likes of females. It could be two-tailed test if we wanted to test just if the means of the two populations are not equal.

Also, we need the degrees of freedom which is **number of samples of male + number of samples of female - 2**.

In [None]:
dof=10+10-2
dof

![image.png](attachment:image.png)

As we can see in our table, the critical value for one tail, DOF=18 and significance level of 0.05 is 1.734. Our **t-statistic was 2.94 which is higher than 1.734** thus **we will reject the NULL Hypothesis**.

In [None]:
t_stat, p = ttest_ind(df['male'], df['female'])
print(f't={t_stat}, p={p}')

![image.png](attachment:image.png)

### F-test
An F-test is used to test whether two population variances are equal. The null and alternative hypotheses for the test are as follows:

H0: σ1^2 = σ2^2 (the population variances are equal)

H1: σ1^2 ≠ σ2^2 (the population variances are not equal)

The F-test statistic is calculated as s1^2 / s2^2

In [None]:
x = [18, 19, 22, 25, 27, 28, 41, 45, 51, 55]
y = [14, 15, 15, 17, 18, 22, 25, 25, 27, 34]

In [None]:
import scipy
#define F-test function
def f_test(x, y):
    x = np.array(x)
    y = np.array(y)
    f = np.var(x, ddof=1)/np.var(y, ddof=1) #calculate F test statistic 
    dfn = x.size-1 #define degrees of freedom numerator 
    dfd = y.size-1 #define degrees of freedom denominator 
    p = 1-scipy.stats.f.cdf(f, dfn, dfd) #find p-value of F test statistic 
    return f, p

In [None]:
#perform F-test
f_test(x, y)

The **F test statistic is 4.38712** and the corresponding **p-value is 0.019127**. Since this p-value is less than .05, **we would reject the null hypothesis**. This means we have sufficient evidence to say that the two population variances are not equal.