# M297 Introductory Python for Data Science 

In this notebook, I'll give you a super simple programming introduction in Python. Of course there are many advanced things we can do in Python, but we'll focus primarily on the kind of Python concepts we need for Data Science.


## Variables

Variables are pretty much like placeholders for information that you want to carry in your analysis. 

For example, if you had some complicated message or number, it would be weird to just consistently type that kind of data over and over again so we store that information as a variable instead. 

Let's look at an example:
   
   

In [2]:
a = 300000000
b = 700000000


print(a)
print(b)

# What if we want to add these together? 

print(a + b)


300000000
700000000
1000000000


## Data Structures: Lists and Dictionaries 

Many times, we do not really want to handle one number at a time, but rather collections of numbers for faster calculations. 

For example, what if we wanted to calculate the mean $ \mu $ of a list of numbers?

$$ \mu = \frac{1}{N} \sum_{i=0}^{N}{x_i} $$

In [1]:
# Let's generate a list of numbers from 0 to 10 

nums = [0,1,2,3,4,5,6,7,8,9,10]

# to get the number of numbers of in the list, we can use a function called len() 

n_nums = len(nums) 

# to add up all the numbers in nums, we can use the sum() function 

total_nums = sum(nums) 

# the mean is just the sum divided by the length 

mean = total_nums / n_nums 

print(f"𝜇 is {mean}") 

𝜇 is 5.0


What if we wanted to generate a larger sequence of numbers? Using a concept in programming called iterations, we can use a for loop to do an operation to every number within the collection such as a list and even more so, we can even put the for loop inside of a list to create a brand new list. This concept is known as a list comprehension in Python. 



In [24]:
# list comprehension example 

# creating a sequence of numbers from 0 to 49
sequenceNums = [i for i in range(50)] 
print(sequenceNums) 
len(sequenceNums) 

# What if we want to get the first number from this list? 
print(sequenceNums[0]) 
# What if we want to get the last number from this list? 
print(sequenceNums[-1]) 
# What if we want to get the first 10 numbers from this list?
print(sequenceNums[0:10])  

# What if we want to add a number to our list? 
sequenceNums.append(50) 
print(sequenceNums) 
# What if we want to remove the last number? 
sequenceNums.remove(sequenceNums[-1])
print(sequenceNums)


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
0
49
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]


In [8]:
# Generating a list of 50 random integers from 0 to 100 
import random
randNums = [random.randint(0,100) for i in range(50)]
print(randNums)
len(randNums) 

[93, 89, 88, 46, 16, 81, 46, 71, 4, 41, 25, 94, 29, 51, 32, 50, 9, 5, 84, 68, 63, 15, 53, 57, 87, 72, 99, 50, 71, 60, 55, 10, 38, 25, 96, 17, 75, 66, 83, 91, 27, 37, 43, 16, 15, 56, 27, 17, 27, 44]


50

Dictionaries are basically just like they sound like. If you know the name of a word, you can look up its meaning. In Python, meaning is basically the value of the word you know the name of which is referred to as a key. In other words, a dictionary has a bunch of key, value pairs. This will be really useful when we want to look up values of certain columns in our datasets in the Pandas data analysis package in Python. 

In [28]:
# example dictionary 

myDictionary = {'birds': ['mockingbird', 'crow', 'sparrow', 'nightingale'], 'burgers': ['chicken', 'veggie', 'impossible', 'beef', 'lamb', 'fish', 'tofu']}
print(myDictionary)

# What if we just want to access the list of birds? We can look up the value of the key, birds as follows:
print(myDictionary['birds']) 
# What if we want to get 'mockingbird' from the birds list?
print(myDictionary['birds'][0]) 
# What if we want a list of all the keys in the dictionary? Use the list() function
print(list(myDictionary.keys())) 

{'birds': ['mockingbird', 'crow', 'sparrow', 'nightingale'], 'burgers': ['chicken', 'veggie', 'impossible', 'beef', 'lamb', 'fish', 'tofu']}
['mockingbird', 'crow', 'sparrow', 'nightingale']
mockingbird
['birds', 'burgers']


## Functions 

Functions are just like they sound. They can have or not have inputs and definitely give you an output. This is super useful in programming because we want to be able to reuse certain parts of our programs for various tasks. In data science, we can go all the way from loading the data to making various plots of the data in a function which makes it a powerful concept in programming. 

In [35]:
# example of a function to calculate the mean of any collection 

def mean(nums):
    return sum(nums) / len(nums) 

# Let's test out this function on a generated list of 50 random integers from 0 to 1000 

print(mean([random.randint(0,1000) for i in range(50)]))  


443.74


## Numpy 

Numpy stands for numerical python which is used a lot for doing calculations on data that isn't limited by a list. Sometimes we want to be able to store numbers and data into a matrix and this can be done easily in Numpy as well as being able to do operations with otehr matrices and vectors. We can even get some interesting summary statistics, generate numbers, and we can also  solve systems of equations. 

In [44]:
# let's convert a list of 10 numbers into a numpy array with 2 rows and 5 columns (a 2x5 matrix) 
import numpy as np
ten_nums = [i for i in range(10)] 
ten_array = np.array(ten_nums) 
print(ten_array)
print(ten_array.shape)
ten_matrix = ten_array.reshape(2,5) 
print(ten_matrix)
print(ten_matrix.shape)  

# let's try to get the number in the 4th column in the 2nd row.

print(ten_matrix[1][3]) 



[0 1 2 3 4 5 6 7 8 9]
(10,)
[[0 1 2 3 4]
 [5 6 7 8 9]]
(2, 5)
8


What if we wanted to solve a system of linear equations? 

For example solving for x and y in the following:

$$ 3x + 4y = 10 $$
$$ 2x + 3y = 7 $$ 



In [51]:
# solving the above system with numpy 

# we'll keep the coefficients of x and y in its own matrix and the numbers 10 and 7 in another matrix. 

matrix_a = np.array([3,4,2,3]).reshape(2,2)
matrix_b = np.array([10,7]) 

# use np.linalg.solve(a,b) to solve the system of equations 

print(f"x is {round(np.linalg.solve(matrix_a, matrix_b)[0])} and y is {round(np.linalg.solve(matrix_a, matrix_b)[1])} ") 


x is 2.0 and y is 1.0 


## Pandas 

This is the package in Python that really helps us do some cool stuff with analyzing data. Since it's so extensive, we'll only look at a few features that are really handy for the tutorial we're going to do with the Boston Housing dataset. 

In [60]:
# Generate random data into a dictionary

panDict = {'col1': np.random.randn(50), 'col2': np.random.randn(50)} 


# Import pandas 

import pandas as pd 

# Create a dataframe 

sample = pd.DataFrame.from_dict(panDict) 

# Let's view the data frame 

sample.head() 

# Let's see the values which are greater than 0.5 in col2 

print(sample['col2'])

0    -0.534487
1     0.047679
2    -1.349423
3    -0.651031
4     1.036574
5     0.814478
6    -1.140341
7    -1.662922
8    -0.818976
9    -1.847582
10   -0.552704
11   -0.291488
12    1.033004
13   -0.052482
14    0.782225
15    0.477553
16   -0.317677
17    1.115815
18    0.067356
19   -0.352074
20   -0.425968
21   -1.087061
22    0.517688
23   -0.456893
24   -1.062250
25   -1.522507
26   -0.740599
27    0.431170
28   -0.796963
29   -0.500425
30    0.510907
31    1.059819
32   -0.773968
33   -1.573701
34   -1.008079
35    0.835911
36    0.098540
37   -0.304328
38    1.454337
39    0.604341
40    0.315380
41   -0.962313
42    0.135408
43   -1.197189
44    0.091970
45    1.262085
46   -1.848187
47    0.341207
48   -0.296334
49   -0.673495
Name: col2, dtype: float64
