<a href="https://colab.research.google.com/github/Nickguild1993/Business_Py_Explorations/blob/main/Python_Introduction_Notebook_Winter_22.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Importing Libraries/Packages
You can import libraries/packages at any point in a notebook, but it's a best practice to have them all imported at the top for readability and clarity.

In [1]:
# Importing Libraries/packages/dependencies 


import pandas as pd #aliasing the library so it's easier to call
import numpy as np # linear algebra library (what pandas is built on)

# Visualization libraries
import seaborn as sns 
import matplotlib as plt

In [2]:
# the zen of python

import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


The zen of python sums up a lot of what I find to be so great the language.  Especially when it comes writing your script to be explicit and simplistic.  
Python uses indentation to indicate blocks of code, which I find easier to follow.  You don't have to worry about {} or ; like you do in other languages.

In [3]:
# Exmaple of indenting 
x = 12
if x > 6:
  print("X is greater than 6")
else:
  print("X is not greater than 6")

X is greater than 6


In [4]:
fruit = "orange"

if "a" in fruit:
    print("There is in fact, an *a* in orange")
else: 
    print("Suprisingly, there is not an *a* in orange")

There is in fact, an *a* in orange


In [5]:
numbers = [2,4,10,12]

for number in numbers:
  print(number*2)

4
8
20
24


In [6]:
def divide(x):
  return x / 4

print(divide(120.60))

30.15


In [7]:
# If you don't indent your blocks, you'll get thrown an error 
y = 9
if y % 2 == 0:  #     % 2 == 0 is checking to see if the value is divisible by 2
print("y is even")
else:
print("y is odd")

IndentationError: ignored

#### Python datatypes

Python has 4 commonly used datatypes (there are others but we'll focus on these)

String (str)
Boolean (bool)
Integer (int)
Float (float)

In [8]:
# Python datatypes
a = "Texas"
print(type(a))

b= 50
print(type(b))

c = 10.3
print(type(c))

d = True # notice how True is a different color- that indicates that it is a keyword (you can't call a variable True)
print(type(d))

<class 'str'>
<class 'int'>
<class 'float'>
<class 'bool'>


In [9]:
# What happens when you try to assign a keyword to a variable
True = 23
print(True)

SyntaxError: ignored

##### Python Data Collection Types

When you're looking to store multiple variables together, you have a few options to choose from.  Think of them as collections.  The type you'll use depends both on values- how they're structured as well as how you want to interact with them.

#### Lists
  Lists are the most straightforward collection type. Key features:
  Ordered - The order of the values *matters* - this is how we're able to call values in a list based on their index position.
  Mutable - You can alter the contents of a list.  Sorting, applying a function, and filtering are common operations you can perform.
  Indexable - As mentioned above, you can call a value based on it's position. (Remember that python indexing starts at 0)
  Since lists are indexed, you can have items with the same values
  Heterogeneous - You can store different data types within a single list!
  

In [10]:
# Lists
my_list = [12, "apple", 4.2, True] # Note the square brackets // commas seperate values // different data types

print(my_list[0])

print(my_list[-1]) # indexing -1 will return the last value in the list // you aren't limited accessing the last value, you can count backwards as far as you want.

my_list.append("last value") # append adds the value to the end of the list.
print(my_list)

my_list.insert(2, "grapefruit") # Insert method allows you to add an item at a specific index
print(my_list)


12
True
[12, 'apple', 4.2, True, 'last value']
[12, 'apple', 'grapefruit', 4.2, True, 'last value']


#### Dictionaries

Dictionaries are used to store 

#### Importing Datasets
You have a lot of options for importing data- csv/xlsx, json, API, etc.

For this notebook I'll be using a raw csv string from github.  In my opinion it's the easiest way to import a dataset that isn't stored locally.

The dataset we'll be importing is the *Titantic Dataset* which is a popular choice for praticing analysis similar to the *cars* or *iris* datasets in R.


In [11]:
url = "https://raw.githubusercontent.com/rolandmueller/titanic/main/titanic3.csv" # You can name this variable whatever you want, but I use URL for clarity
df = pd.read_csv(url) # defining our dataset as df 
df.head(5) # returns the first 5 rows of the dataframe 

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


##### Cleaning and Exploring the Data

In [12]:
df.info() # you could also use df.dtypes but .info() method returns more information

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB


In [13]:
df.describe().round(2) # chaining the .round() method modifies the results by limiting the decimal places to the passed value -> (2)
# Returns descriptive statistics for the numeric columns in the dataframe

Unnamed: 0,pclass,survived,age,sibsp,parch,fare,body
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,2.29,0.38,29.88,0.5,0.39,33.3,160.81
std,0.84,0.49,14.41,1.04,0.87,51.76,97.7
min,1.0,0.0,0.17,0.0,0.0,0.0,1.0
25%,2.0,0.0,21.0,0.0,0.0,7.9,72.0
50%,3.0,0.0,28.0,0.0,0.0,14.45,155.0
75%,3.0,1.0,39.0,1.0,0.0,31.28,256.0
max,3.0,1.0,80.0,8.0,9.0,512.33,328.0


#### Cleaning 

Our dataset has an meaningless index- lets rename it

In [14]:
df.index =df.index.rename("passenger number")
df.sample(5) # is similar to .head() or .tail() except that it'll return (x) number of observations

Unnamed: 0_level_0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
passenger number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
493,2,0,"Mallet, Mr. Albert",male,31.0,1,1,S.C./PARIS 2079,37.0042,,C,,,"Paris / Montreal, PQ"
1268,3,0,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5,,S,,,
182,1,1,"LeRoy, Miss. Bertha",female,30.0,0,0,PC 17761,106.425,,C,2.0,,
1154,3,0,"Rogers, Mr. William John",male,,0,0,S.C./A.4. 23567,8.05,,S,,,
925,3,0,"Kelly, Mr. James",male,44.0,0,0,363592,8.05,,S,,,


Checking the dimensions and possible null values

In [15]:
print("Dimensions of the dataframe:", df.shape)
print("___________________________________")
print(df.isnull().sum())

Dimensions of the dataframe: (1309, 14)
___________________________________
pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64


The *Embarked* column values are a single letter, lets rename them so that they're more meaningful

In [16]:
# first we'll need to get all the unique values for that column
print(df["embarked"].value_counts())

S    914
C    270
Q    123
Name: embarked, dtype: int64


I prefer square bracket notation instead of dot notation when I'm accessing a column as some column names have spaces inbetween.  

In [17]:
df["embarked"] = df["embarked"].replace(("S", "C", "Q"), value = ("Sussex", "Chelsea", "Quebec"))  # Note that the value order matters! - "S" -> "Sussex", "C" -> "Chelsea"...
print(df["embarked"].value_counts())

Sussex     914
Chelsea    270
Quebec     123
Name: embarked, dtype: int64


Let's add a column into the dataframe based on a current column's values using np.where()

In [18]:
df["lived"] = np.where(df["survived"]==1, "yes", "no")  # np.where 
df.head(1)

Unnamed: 0_level_0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,lived
passenger number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,Sussex,2,,"St Louis, MO",yes


```np.where()``` breakdown


*   ```(df["survived"]...``` is passing the dataframe with the specific column we want to base our new column on into the ```np.where()``` statement
*   ```==1, "yes", "no")``` inputs that if the response to "survived" is **1**, then the corresponding value for the ```"lived"``` column should be "yes", if it isn't **1** then it'll be "no" (since they uh, died)







**Reordering columns**
> When you're reordering columns there's an option you need to choose- do you want this reordered dataframe to overwrite the existing dataframe, or do you want to define a new one that is seperate of the original?

> This decision pretains to any instance that you're performing an operation on a previously defined variable!




In [19]:
#  creating a new dataframe with existing columns
first_class_df = df[df["pclass"] == 1]
# we've created a new dataframe that only contains observations where the "pclass" column is equal to 1

In [20]:
first_class_df = first_class_df[["lived","name", "sex", "age", "fare", "cabin", "embarked"]] 

Note the use of two sets of square brackets.  This is because we're calling multiple columns.  Doing so is applicable to a wide range of interactions that involve more than one value.

In [21]:
first_class_df.sort_values("fare", ascending=False).head(5) # by default, sort_values sorts in ascending order.  You can see the arguments for a function by hovering over the empty ()

Unnamed: 0_level_0,lived,name,sex,age,fare,cabin,embarked
passenger number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
183,yes,"Lesurer, Mr. Gustave J",male,35.0,512.3292,B101,Chelsea
50,yes,"Cardeza, Mrs. James Warburton Martinez (Charlo...",female,58.0,512.3292,B51 B53 B55,Chelsea
49,yes,"Cardeza, Mr. Thomas Drake Martinez",male,36.0,512.3292,B51 B53 B55,Chelsea
302,yes,"Ward, Miss. Anna",female,35.0,512.3292,,Chelsea
111,yes,"Fortune, Miss. Alice Elizabeth",female,24.0,263.0,C23 C25 C27,Sussex


In [22]:
wanted_columns = ["lived", "name", "sex", "age", "embarked", "home.dest", "pclass", "fare", "boat", "cabin"]

created a list that we can pass through the updated dataframe.  This variable can now also be used locally throughout the notebook- I'd *strongly* recommend naming variables instead of how the above ```first_class_df``` columns were selected.  

In [23]:
df = df[wanted_columns]
df.head(3)

Unnamed: 0_level_0,lived,name,sex,age,embarked,home.dest,pclass,fare,boat,cabin
passenger number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,yes,"Allen, Miss. Elisabeth Walton",female,29.0,Sussex,"St Louis, MO",1,211.3375,2.0,B5
1,yes,"Allison, Master. Hudson Trevor",male,0.9167,Sussex,"Montreal, PQ / Chesterville, ON",1,151.55,11.0,C22 C26
2,no,"Allison, Miss. Helen Loraine",female,2.0,Sussex,"Montreal, PQ / Chesterville, ON",1,151.55,,C22 C26


#### Data Selection with .iloc[] and .loc[] 

.iloc[] and .loc[] are my preferred methods for selecting data.  

iloc is integer position-based (think *i* for *integer*) 

specify the rows and/or columns by their respective **integer position values**

Examples


*   Single value: df.iloc
*   List item



loc is label-based
  
1.   specify rows and and/or columns by their respective **labels**








In [23]:
# One thing I like to do is get a sample observation isolated so I can get a feel for how it's structured.

df.loc[0, "embark_town"]

'Southampton'

#### Group by statements 

In [32]:
df_classes = df.groupby("pclass").size()
print(df_classes)
print(type(df_classes))
# Returns a series, which is a single column with x amount of observations

pclass
1    323
2    277
3    709
dtype: int64
<class 'pandas.core.series.Series'>


In [38]:
# To preform aggregations on the data, we'll first use a list to pass through the aggregate argument
df["fare"].agg(["sum", "mean","size"]).round(2)

sum     43550.49
mean       33.30
size     1309.00
Name: fare, dtype: float64

While you can certainly use a list, it is somewhat limited in what you can include, I prefer to use a dictionary instead.  dictionaries use key:value pairs.  Note: the **keys** have to be unique values, but the **pairs** do not have to be

In [40]:
agg_dict = {"fare" :
             ["sum", "mean", "size"],
            "age" :
            ["sum", "mean"]}

            
df.groupby("pclass").agg(agg_dict).round(2)

Unnamed: 0_level_0,fare,fare,fare,age,age
Unnamed: 0_level_1,sum,mean,size,sum,mean
pclass,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
1,28265.4,87.51,323,11121.42,39.16
2,5866.64,21.18,277,7701.25,29.51
3,9418.45,13.3,709,12433.0,24.82


In [43]:
agg_math = {"fare":
            ["sum", "mean", "median", "min", "max", "std"]
            }

In [44]:
# Because I defined the above dictionary as a variable, I can call it in this cell (or anywhere else in the notebook that will run after that cell)

df.groupby("embarked").agg(agg_math).round(1)

Unnamed: 0_level_0,fare,fare,fare,fare,fare,fare
Unnamed: 0_level_1,sum,mean,median,min,max,std
embarked,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Chelsea,16830.8,62.3,28.5,4.0,512.3,84.2
Quebec,1526.3,12.4,7.8,6.8,90.0,13.6
Sussex,25033.4,27.4,13.0,0.0,263.0,37.1


In [52]:
agg_age = {"age":
           ["mean", "median", "min", "max", "std"]
           }
df.groupby("embarked").agg(agg_age).round()

Unnamed: 0_level_0,age,age,age,age,age
Unnamed: 0_level_1,mean,median,min,max,std
embarked,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Chelsea,32.0,30.0,0.0,71.0,15.0
Quebec,29.0,26.0,2.0,70.0,15.0
Sussex,29.0,28.0,0.0,80.0,14.0


#### Pandas Pivot Table

While groupby statements chained with agg functions are useful, They do have certain limitations. If we want to easily create a multi-index table, one option is using the ```pivot_table()``` method instead of creating a complex groupby statement.

In [57]:
pd.pivot_table(df, index = ["embarked", "sex"]).round(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,age,fare,pclass
embarked,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Chelsea,female,31.22,81.13,1.65
Chelsea,male,33.28,48.81,2.0
Quebec,female,25.46,12.55,2.9
Quebec,male,31.56,12.27,2.89
Sussex,female,27.88,39.34,2.21
Sussex,male,29.94,21.84,2.41


In [59]:
# the pclass column isn't all that helpful, so lets be more explicit

pd.pivot_table(df, index = ["embarked", "sex"], values = ["age", "fare"]).round(1)

Unnamed: 0_level_0,Unnamed: 1_level_0,age,fare
embarked,sex,Unnamed: 2_level_1,Unnamed: 3_level_1
Chelsea,female,31.2,81.1
Chelsea,male,33.3,48.8
Quebec,female,25.5,12.6
Quebec,male,31.6,12.3
Sussex,female,27.9,39.3
Sussex,male,29.9,21.8
