# Python Introduction - 09/05/22

Prep: Create a virtual environment in Anaconda (Prompt or terminal, depending on operating system), using `conda create --name environment_name` (to create an environment with a specific Python version, run `conda create --name environment_name python=3.4`). You can see a list of all your environment with `conda info --envs`. Activate your environment using `conda activate environment_name`. To install packages into your environment, run `conda install numpy pandas matplotlib seaborn scikit-learn spyder notebook` and any other packages you want to install. You can also do this directly when creating the environment and specify version of packages you want to work with, e.g. `conda create -n environment_name scipy=0.15.0` (`-n`is just the short version of `--name`). `conda list -n environment_name` gives you a list of all packages installed in the current environment. You can deactivate your environment with `conda deactivate`.\
Hint: sometimes you will have to install packages with `pip` instead of `conda`. It might be useful to install `pip` into your base by running `conda install pip` (for Windows users: you might have to open your Anaconda Prompt as an administrator to be able to install to base but please look this up first to see what it means to run prompt as an administrator).\
For more infos on Anaconda see https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0bc00ca/conda-cheatsheet.pdf.

To open a notebook, either launch Jupyter Notebook from your environment directly in Anaconda or navigate into your environment (`conda activate environment_name`) and run `jupyter notebook` in the Prompt/terminal.

### 1. Basics: Lists, sets, tuples, dictionaries

#### 1.1 Lists: can contain any type of object; mutable (can be changed)

In [None]:
list1 = [1,2,3]
list1

In [None]:
list2 = [1.3, 6.8, "house", [1, 6, "street"]]
list2[1] #Python starts at 0

In [None]:
list2[3][0] #access list in list

In [None]:
list2[-3] = "street" #mutable: list can be changed after creation
list2

In [None]:
list3 = list([1,2,4]) #can also be created with list()
list3

#### 1.2 Sets: mutable, no order  

In [None]:
set1 = {"cat", 1, "dog"}
set1

In [None]:
#can't change by indexing because there is no order. Can be changed by adding and updating
set1.add(2)
set1

In [None]:
set1.update(["fish", 3])
set1

#### 1.3 Tuples: immutable (can't be changed after creation; faster to work with than lists)

In [None]:
pets = "cat", "dog", "fish", "mouse"
pets2 = ("cat", "dog", "fish", "mouse")
pets

In [None]:
tuple1 = ("cat", 2, 4)
#tuple1[0] = 3 #gives error because tuple is immutable

In [None]:
tuple2 = ("cats") * 3
tuple2

In [None]:
tuple2 = ("cats", ) * 3
tuple2

In [None]:
#indexing works because there's an order
pets[1:] #position 1 to end
#pets[:] #everything

#### 1.4 Dictionaries: key-value pairs

In [None]:
dict1 = {1: 'cat', 2: 'dog'}
dict1[1] #call key 1

In [None]:
dict2 = {'name': 'John', 1: [2, 4, 3]} #possible to have lists or strings as a value
dict2['name']

In [None]:
dict2[3] = 'new_value' #mutable
dict2

In [None]:
dict3 = dict({1:'apple', 2:'ball'}) #can also create dictionary with dict()
dict3[1]

### 2. Basics: Functions, loops, if statements

#### 2.1 Loops

In [None]:
for i in range(1, 10):
    a = i*2
    print(a)

#### 2.2 Functions

In [None]:
def first_function(a):
    b = a + 1
    c = a + b
    d = c/a
    
    return(b,c,d) #you can return multiple outputs in Python

In [None]:
b, c, d = first_function(a = 3)
print(b, c, d)

In [None]:
b = first_function(a = 3)[0] #only get first returned output
print(b)

#### 2.3 If/elif/else statements

In [None]:
a = 12
b = 10

if (a > b):
    print("a is greater than b")
elif (a == b):
    print("a equals b")
else:
    print("b is greater than a")

### 3. Important packages

We will focus on:
- **pandas:** reading/writing data from/to different file formats; for data exploration/cleaning/manipulation
- **NumPy:** works with (multidimensional) arrays (faster than lists); for numerical operations
- **Matplotlib:** data visualisation 

\
More resources on useful packages: overview: https://learnpython.com/blog/most-popular-python-packages/; with links to where you can learn using the packages: https://www.kdnuggets.com/2021/03/top-10-python-libraries-2021.html

#### Load packages

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#### 3.1 NumPy

In [None]:
a = np.arange(15) #creazes an array from 0-14 (default step size: 1)
a

In [None]:
a = np.reshape(a, (3, 5)) #change dimensions to 3x5
a

In [None]:
b = np.arange(15).reshape(3, 5)#can also do all in one
b

In [None]:
c = np.arange(0, 15, 0.5) #change step size; also requires to set start and end point
c 

In [None]:
d = np.linspace(0, 15, 10000) #creates an array using number of steps rather than step size
d

In [None]:
e = np.zeros(5)
f = np.ones(5)
g = np.empty_like(c) #creates array with specific dimensions that can be altered later
print(e, f, g)

#### 3.2 pandas

#### Create objects

In [None]:
series1 = pd.Series([1, 3, 5, np.nan, 6, 8])
series1

In [None]:
dates1 = pd.date_range("20210101", periods=6)
dates1

In [None]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates1, columns=list("ABCD"))
df

In [None]:
dict1 = {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
df2 = pd.DataFrame(dict1)
df2

In [None]:
df2.sort_values(by="B")

In [None]:
df2["B"]

In [None]:
df.loc["20210101", "B"]

In [None]:
df.iloc[0, 1]

#### Load files

In [None]:
import os #os package: operating system interface (allows to create/modify/remove directories)
os.getcwd()

In [None]:
os.chdir('C:/Users/sandr/Documents/python_intro')
os.getcwd()

In [None]:
titanic = pd.read_csv('./Titanic.csv') 

In [None]:
titanic.head()

In [None]:
titanic.tail()

In [None]:
freq_class = pd.DataFrame(titanic["PClass"].value_counts())
print(freq_class)

In [None]:
na_titanic = pd.DataFrame(titanic.isna().sum())
print(na_titanic)

In [None]:
titanic_drop = titanic.dropna(axis = 1)
na_titanic = pd.DataFrame(titanic_drop.isna().sum())
print(na_titanic)

In [None]:
titanic = titanic.dropna(axis = 0)
na_titanic = pd.DataFrame(titanic.isna().sum())
print(na_titanic)

In [None]:
titanic.hist("Age")

In [None]:
titanic = pd.DataFrame(titanic)

More material on pandas: https://pandas.pydata.org/docs/user_guide/dsintro.html

#### 3.3 Matplotlib

In [None]:
ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))
ts.head()

In [None]:
ts = ts.cumsum()
ts.head()

In [None]:
ts.plot(); #outside of notebook: use plt.show() at the end to see plot

In [None]:
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=["A", "B", "C", "D"])
df = df.cumsum()

In [None]:
plt.figure();
df.plot();

In [None]:
ts1 = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))
ts2 = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))

In [None]:
fig = plt.figure(figsize = (12, 4))

#plt.subplot(121)
plt.plot(ts1)
plt.plot(ts2)
plt.xlabel('time (a.u.)')
plt.ylabel('random value')
plt.title("Time series")
plt.legend(('time series 1', 'time series2'), loc = 1)
fig.savefig('myfirstplot.png', dpi = 100)

In [None]:
fig = plt.figure(figsize = (17, 4))

plt.subplot(121)
plt.plot(ts1)
plt.xlabel('time (a.u.)')
plt.ylabel('random value')
plt.title('Time series 1')

In [None]:
fig = plt.figure(figsize = (17, 4))

plt.subplot(121)
plt.plot(ts1)
plt.xlabel('time (a.u.)')
plt.ylabel('random value')
plt.title('Time series 1')

plt.subplot(122)
plt.plot(ts2)
plt.xlabel('time (a.u.)')
plt.ylabel('random value')
plt.title('Time series 2')
fig.savefig('mysecondplot.png', dpi = 100)

In [None]:
plt.style.use("ggplot")

In [None]:
fig = plt.figure(figsize = (17, 4))

plt.subplot(121)
plt.plot(ts1)
plt.xlabel('time (a.u.)')
plt.ylabel('random value')
plt.title('Time series 1')

plt.subplot(122)
plt.plot(ts2)
plt.xlabel('time (a.u.)')
plt.ylabel('random value')
plt.title('Time series 2')

In [None]:
# Age Histogram
plt.hist(titanic.Age)
plt.axvline(titanic.Age.mean(), color='k', linestyle='dashed', linewidth=1)
plt.title('Ages of Passengers on Titanic')
plt.ylabel('Count')
plt.xlabel('Age (years)')

In [None]:
print(round(titanic[['PClass','Survived']].groupby(['PClass']).mean()*100,1))

In [None]:
print(round(titanic[['Sex', 'PClass','Survived']].groupby(['PClass', 'Sex']).mean()*100,1))

In [None]:
import seaborn as sns

In [None]:
sns.set(font_scale=1)
fig = sns.catplot(x="Sex", y="Survived", col="PClass", #catplot: former factorplot
                    data=titanic, saturation=.5,
                    kind="bar", ci=None, aspect=.6)

(fig.set_axis_labels("", "Survival Rate")
    .set_xticklabels(["Female", "Male"])
    .set_titles("{col_name} {col_var}")
    .set(ylim=(0, 1))
    .despine(left=True))  
plt.subplots_adjust(top=0.8)
fig.fig.suptitle('Survivors by Passenger Class');
fig.savefig('survivors.png', dpi = 100)

#### 3.4 Extra: Intro to scikit-learn (most important package for machine learning)

In [None]:
from sklearn.decomposition import PCA
from sklearn import preprocessing

In [None]:
titanic.shape

In [None]:
titanic #don't need names; also: sex and class are difficult to work with the way the are

In [None]:
titanic = titanic.drop(['Name'],axis=1)

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
for label in ['PClass','Sex']:
    titanic[label] = LabelEncoder().fit_transform(titanic[label])
titanic

In [None]:
features = titanic.drop(['Survived'], axis=1)

Perform feature transformation in form of principle component analysis (usually need to normalize data first etc., only meant to briefly show what scikit-learn can do here).

In [None]:
pca = PCA(n_components = 0.9)
pca.fit(features)
reduced_pca = pca.transform(features)
reduced_pca.shape

90% of the variance in the data was explained by 1 principle components.

#### Other useful packages: 
- scipy (statistics, optimization, linear algebra etc.);
- seaborn (pretty plots);
- sympy (integration, differentiation, symbolic mathematics)