## Class 2 Python for Data Science
### Python Dictionary
### List Comprehension
### Reading CSV file and fix data errors

One of Python's built−in datatypes is the dictionary, which defines one−to−one relationships between keys and values.

"Like lists dictionaries can easily be changed, can be shrunk and grown ad libitum at run time. They shrink and grow without the necessity of making copies. Dictionaries can be contained in lists and vice versa. But what's the difference between lists and dictionaries? Lists are ordered sets of objects, whereas dictionaries are <b>unordered sets.</b> But the main difference is that items in dictionaries are accessed via keys and not via their position."

<br>
A pair of braces creates an empty dictionary: {}. Placing a comma-separated list of key:value pairs within the braces adds initial key: value pairs to the dictionary; this is also the way dictionaries are written on output.

In [8]:
dict1 = {"fruit" : [75,"orange"], "vegetable":"onion, mushroom, lettuce"}
dict1

{'fruit': [75, 'orange'], 'vegetable': 'onion, mushroom, lettuce'}

### Keys

Get the keys from "dict1"

In [2]:
dict1.keys()

dict_keys(['fruit', 'vegetable'])

### Indexing With Keys?

What happens if you try to run "<b>dict1[0]</b>"? Why?


In [9]:
dict1["fruit"]

[75, 'orange']

OR

In [33]:
dict1.get("fruit")

[75, 'orange']

### ii.Values

Get the values from "dict1"

In [34]:
dict1.values()

dict_values([[75, 'orange'], 'onion, mushroom, lettuce'])

### Indexing With Values?
A little more complicated

In [35]:
V = 'onion, mushroom, lettuce'

for key, value in dict1.items():
    if value == V:
        K = key
print(K)

vegetable


### iii. Length of Dictionary

Returns the number of stored entries, i.e. the number of (key,value) pairs.

In [36]:
len(dict1)

2

### iv. Remove key and value

In [37]:
del dict1["vegetable"]
print(dict1)

{'fruit': [75, 'orange']}


### v. Add new value

In [38]:
dict1["new"] = 0
print(dict1)

{'fruit': [75, 'orange'], 'new': 0}


### vi. Concatenating Dictionaries
<i>*Note: Keys must be unique</i>

In [12]:
dict1 = {"fruit" : "orange, watermelon, grape", "vegetable":"onion, mushroom, lettuce"}
dict2 = {"fruit1": [5,6,7]}
dict1.update(dict2)
dict1

{'fruit': 'orange, watermelon, grape',
 'vegetable': 'onion, mushroom, lettuce',
 'fruit1': [5, 6, 7]}

### <font color = "coral">Exercise 1: Create a new dictionary</font>
<font color = "coral">Your keys should be "Country","State","City","ZipCode"

Fill in the values according to the keys.

In [2]:
#Your code here

dict1 = {
    "Country": "USA",
    "State": "CA",
    "City": "Sunnyvale",
    "Zip": 95051
}

print(dict1)

{'Country': 'USA', 'State': 'CA', 'City': 'Sunnyvale', 'Zip': 95051}


## Multi-dimensional Array

In [41]:
a = [[0,  1, 2, 3, 4, 5],
     [10,11,12,13,14,15],
     [20,21,22,23,24,25],
     [30,31,32,33,34,35],
     [40,41,42,43,44,45],
     [50,51,52,53,54,55]]

In [42]:
a[0]

[0, 1, 2, 3, 4, 5]

In [43]:
a[4:6]

[[40, 41, 42, 43, 44, 45], [50, 51, 52, 53, 54, 55]]

In [44]:
a[5][5]

55

### List Comprehensions

In [13]:
bbb = [x**2 for x in range(15)]
print(bbb)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196]


 SAME AS BELOW

In [14]:
original = list(range(15))
print(original)
squares = []

for x in original:
    squares.append(x**2)
    
print(squares)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196]


#### What is happening in this loop?

In [47]:
new = []
for x in squares:
    if x < 100:
        new.append(x**2)
new

[0, 1, 16, 81, 256, 625, 1296, 2401, 4096, 6561]

In [48]:
new = [i**2 for i in squares if i < 100]
new 

[0, 1, 16, 81, 256, 625, 1296, 2401, 4096, 6561]

### <font color = "coral">Exercise 2:
<font color = "coral">
Turn this for loop into a nested for list comprehension (Should only be one line).

In [49]:
mystery = []
for i in range(1000):
    if i%5 == 0:
        mystery.append(i)
print(mystery)

[0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, 295, 300, 305, 310, 315, 320, 325, 330, 335, 340, 345, 350, 355, 360, 365, 370, 375, 380, 385, 390, 395, 400, 405, 410, 415, 420, 425, 430, 435, 440, 445, 450, 455, 460, 465, 470, 475, 480, 485, 490, 495, 500, 505, 510, 515, 520, 525, 530, 535, 540, 545, 550, 555, 560, 565, 570, 575, 580, 585, 590, 595, 600, 605, 610, 615, 620, 625, 630, 635, 640, 645, 650, 655, 660, 665, 670, 675, 680, 685, 690, 695, 700, 705, 710, 715, 720, 725, 730, 735, 740, 745, 750, 755, 760, 765, 770, 775, 780, 785, 790, 795, 800, 805, 810, 815, 820, 825, 830, 835, 840, 845, 850, 855, 860, 865, 870, 875, 880, 885, 890, 895, 900, 905, 910, 915, 920, 925, 930, 935, 940, 945, 950, 955, 960, 965, 970, 975, 980, 985, 990, 995]


In [4]:
#Your code here

out = [i for i in range(1000) if i%5 == 0]
print(out)

[0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, 295, 300, 305, 310, 315, 320, 325, 330, 335, 340, 345, 350, 355, 360, 365, 370, 375, 380, 385, 390, 395, 400, 405, 410, 415, 420, 425, 430, 435, 440, 445, 450, 455, 460, 465, 470, 475, 480, 485, 490, 495, 500, 505, 510, 515, 520, 525, 530, 535, 540, 545, 550, 555, 560, 565, 570, 575, 580, 585, 590, 595, 600, 605, 610, 615, 620, 625, 630, 635, 640, 645, 650, 655, 660, 665, 670, 675, 680, 685, 690, 695, 700, 705, 710, 715, 720, 725, 730, 735, 740, 745, 750, 755, 760, 765, 770, 775, 780, 785, 790, 795, 800, 805, 810, 815, 820, 825, 830, 835, 840, 845, 850, 855, 860, 865, 870, 875, 880, 885, 890, 895, 900, 905, 910, 915, 920, 925, 930, 935, 940, 945, 950, 955, 960, 965, 970, 975, 980, 985, 990, 995]


 <h1> <b><font color = coral>&#9998; <font color = coral>EXERCISE 3:</h1></font>

<font color = "coral">
Not that you have all this knowledge on different operators, data types, and loops create a loop that removes all the unwanted information from our list.

<b>1) Create a loop where you get rid of all the odd numbers.
<br><br>
2) Put all the numbers in order from smallest to largest.<br><br>
3) Once you only have a list of ordered even numbers convert all these integers into strings.<br><br>
4) Now print your number strings as a single string with comma separation.</b>



In [10]:
# list for Ex 3
lst = [4,6,3,2,6,8,9,7,23,4,465,7,6,8,454,5,876,567,54,76,34,55,
       33,7653,234234,7857,23432,4353,4,345,4667,23235,1212,221,
       335,2323,21,45,76,5432,54645645,123212245346342,67,34563,2]

In [17]:
#Your code here

even_list = [i for i in lst if (i % 2)==0]
print(even_list)

ordered_even_list = sorted(even_list)
print(ordered_even_list)

str_lst = [str(i) for i in ordered_even_list]
print(str_lst)

str1 = ",".join(str_lst)
print("Joined String : " + str1)

[4, 6, 2, 6, 8, 4, 6, 8, 454, 876, 54, 76, 34, 234234, 23432, 4, 1212, 76, 5432, 123212245346342, 2]
[2, 2, 4, 4, 4, 6, 6, 6, 8, 8, 34, 54, 76, 76, 454, 876, 1212, 5432, 23432, 234234, 123212245346342]
['2', '2', '4', '4', '4', '6', '6', '6', '8', '8', '34', '54', '76', '76', '454', '876', '1212', '5432', '23432', '234234', '123212245346342']
Joined String : 2,2,4,4,4,6,6,6,8,8,34,54,76,76,454,876,1212,5432,23432,234234,123212245346342


 <h1> <b><font color = coral>&#9998; <font color = coral>EXERCISE 4:</h1></font>

<font color = "coral">If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.

Find the sum of all the multiples of 3 or 5 below 10,000.

In [21]:
#Your answer here

def sum_all_multiples_3_or_5(count:int):
    natural = [i for i in range(1,count) if (i%3==0) or (i%5==0)]
    print(natural)

    sum=0
    for i in natural:
        sum += i
    return sum

print("Sum = %s" % sum_all_multiples_3_or_5(10))
print("Sum = %s" % sum_all_multiples_3_or_5(1000))

[3, 5, 6, 9]
Sum = 23
[3, 5, 6, 9, 10, 12, 15, 18, 20, 21, 24, 25, 27, 30, 33, 35, 36, 39, 40, 42, 45, 48, 50, 51, 54, 55, 57, 60, 63, 65, 66, 69, 70, 72, 75, 78, 80, 81, 84, 85, 87, 90, 93, 95, 96, 99, 100, 102, 105, 108, 110, 111, 114, 115, 117, 120, 123, 125, 126, 129, 130, 132, 135, 138, 140, 141, 144, 145, 147, 150, 153, 155, 156, 159, 160, 162, 165, 168, 170, 171, 174, 175, 177, 180, 183, 185, 186, 189, 190, 192, 195, 198, 200, 201, 204, 205, 207, 210, 213, 215, 216, 219, 220, 222, 225, 228, 230, 231, 234, 235, 237, 240, 243, 245, 246, 249, 250, 252, 255, 258, 260, 261, 264, 265, 267, 270, 273, 275, 276, 279, 280, 282, 285, 288, 290, 291, 294, 295, 297, 300, 303, 305, 306, 309, 310, 312, 315, 318, 320, 321, 324, 325, 327, 330, 333, 335, 336, 339, 340, 342, 345, 348, 350, 351, 354, 355, 357, 360, 363, 365, 366, 369, 370, 372, 375, 378, 380, 381, 384, 385, 387, 390, 393, 395, 396, 399, 400, 402, 405, 408, 410, 411, 414, 415, 417, 420, 423, 425, 426, 429, 430, 432, 435, 438, 440, 44

 <h1> <b><font color = coral>&#9998; <font color = coral>EXERCISE 5:</h1></font>

<font color = "coral">
Calculate all square numbers (1,4,9,16,...) below 1,000. What's their sum?  Note: squared number can't exceed 1000.

In [24]:
#Your answer here
import math

sq_nums = [i for i in range(1,1001) if i==int(math.sqrt(i))*int(math.sqrt(i))]
print(sq_nums)

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361, 400, 441, 484, 529, 576, 625, 676, 729, 784, 841, 900, 961]


 <h1> <b><font color = coral>&#9998; <font color = coral>EXERCISE 6:</h1></font>

<font color = "coral">
Write a function to calculate the mean (average) of "lst". Do not use the built-in "mean" functions that Python offers.

In [2]:
lst = [4,6,3,2,6,8,9,7,23,4,465,7,6,8,454,5,876,567,54,76,34,55]
           
#Your answer here

def mean(lst:list):
    return sum(lst)/len(lst)

print("Mean : %8.2f" % mean(lst))

Mean :   121.77


 <h1> <b><font color = coral>&#9998; <font color = coral>EXERCISE 7:</h1></font>

<font color = "coral">
Write a function to calculate the median of "lst". Do not use the built-in "median" functions that Python offers.

In [20]:
lst = [4,6,3,2,6,8,9,7,23,4,465,7,6,8,454,5,876,567,54,76,34,55,
       33,7653,234234,7857,23432,4353,4,345,4667,23235,1212,221,
       335,2323,21,45,76,5432,54645645,123212245346342,67,34563,2]

#Your answer here

def median(lst:list):
    length = len(lst)
    nums = sorted(lst)
    print(nums)
    if length%2 == 0:
        return (nums[(length//2)-1] + nums[(length//2)])/2
    return nums[(length//2)]

print("Median : %8.2f" % median(lst))


[2, 2, 3, 4, 4, 4, 5, 6, 6, 6, 7, 7, 8, 8, 9, 21, 23, 33, 34, 45, 54, 55, 67, 76, 76, 221, 335, 345, 454, 465, 567, 876, 1212, 2323, 4353, 4667, 5432, 7653, 7857, 23235, 23432, 34563, 234234, 54645645, 123212245346342]
Median :    67.00


 <h1> <b><font color = coral>&#9998; <font color = coral>EXERCISE 8:</h1></font>

<font color = "coral">Write a function to calculate the mode of "lst". Do not use the built-in "mode" functions that Python offers.

In [27]:
lst = [4,6,3,2,6,8,9,7,23,4,465,7,6,8,454,5,876,567,54,76,34,55,
       33,7653,234234,7857,23432,4353,4,345,4667,23235,1212,221,
       335,2323,21,45,76,5432,54645645,123212245346342,67,34563,2]

#Your answer here

def counter(lst:list):
    counts = dict()
    for i in lst:
        counts[i] = counts.get(i, 0) + 1
    return sorted(counts.items(), key=lambda x:x[1], reverse=True)

def mode(lst:list):
    m = counter(lst)
    print(m, end="\n\n")
    return m[0]

print(mode(lst))

[(4, 3), (6, 3), (2, 2), (8, 2), (7, 2), (76, 2), (3, 1), (9, 1), (23, 1), (465, 1), (454, 1), (5, 1), (876, 1), (567, 1), (54, 1), (34, 1), (55, 1), (33, 1), (7653, 1), (234234, 1), (7857, 1), (23432, 1), (4353, 1), (345, 1), (4667, 1), (23235, 1), (1212, 1), (221, 1), (335, 1), (2323, 1), (21, 1), (45, 1), (5432, 1), (54645645, 1), (123212245346342, 1), (67, 1), (34563, 1)]

(4, 3)


# Reading CSV file -- bayarea_home_prices data

In [29]:
'''
Dataset description
1) HomeID = Home ID number
2) HomeAge = Age of home in years
3) HomeSqft = Square footage of home
4) LotSize = LotSize
5) BedRooms = Num bedrooms as per county data
6) HighSchoolAPI = API for nearest high school
7) ProxFwy = Distance in miles to Freeway
8) CarGarage = Number of cars in garage; 0 = no garage
9) ZipCode = Postal zip code for the home
10)HomePriceK = Home price in $K (Target)
-------------------------------------------
9 X Variables; 1 Y variable (Target)
Data Points = 100

Data errors:
1) Few ZipCode have starting digit to be 8, it should be 9
2) Few HighSchoolApi scores have two digits, the ending digit 0 is missing
3) Few CarGarage numbers were entered as letter "l", it should be integer 1 
'''

'\nDataset description\n1) HomeID = Home ID number\n2) HomeAge = Age of home in years\n3) HomeSqft = Square footage of home\n4) LotSize = LotSize\n5) BedRooms = Num bedrooms as per county data\n6) HighSchoolAPI = API for nearest high school\n7) ProxFwy = Distance in miles to Freeway\n8) CarGarage = Number of cars in garage; 0 = no garage\n9) ZipCode = Postal zip code for the home\n10)HomePriceK = Home price in $K (Target)\n-------------------------------------------\n9 X Variables; 1 Y variable (Target)\nData Points = 100\n\nData errors:\n1) Few ZipCode have starting digit to be 8, it should be 9\n2) Few HighSchoolApi scores have two digits, the ending digit 0 is missing\n3) Few CarGarage numbers were entered as letter "l", it should be integer 1 \n'

In [30]:
## Reading csv files
def read_file(filename):
    file_open = open(filename,"r")
    data_array = []
    for line in iter(file_open):
        if "HomeID" in line:
            continue
        line_no_newline = line.rstrip() # delete white space characters
        line_split = line_no_newline.split(",")
        # print(line_split)
        data_array.append(line_split)
    return data_array

In [34]:
housing_data = read_file("bayarea_home_prices.csv")

# print(housing_data[0:6])
for i in range(0,6):
    print(housing_data[i], end="\n")

['1', '24', '1757', '6056', '2', '899', '3', '3', '94085', '894']
['2', '10', '1563', '6085', '2', '959', '4', '3', '94085', '861']
['3', '14', '1344', '6089', '2', '865', '4', '3', '94085', '831']
['4', '14', '1215', '6129', '3', '959', '4', '2', '94085', '809']
['5', '24', '1866', '6141', '3', '877', '4', '1', '94085', '890']
['6', '18', '1589', '6148', '2', '920', '3', '0', '84085', '867']


In [35]:
len_housing_data = len(housing_data)
print(len_housing_data)

100


In [36]:
list_HomeAge = []
# for all rows, extract only column 1
for k in range(0,len_housing_data):
    list_HomeAge.append(housing_data[k][1])    

In [37]:
print(list_HomeAge) 
# they are still strings, cannot do numerical calculations with strings 

['24', '10', '14', '14', '24', '18', '13', '19', '17', '24', '12', '22', '15', '25', '10', '20', '23', '16', '10', '13', '17', '10', '15', '10', '21', '12', '13', '10', '17', '24', '10', '18', '11', '19', '12', '14', '13', '22', '22', '15', '23', '21', '17', '11', '15', '11', '21', '22', '12', '19', '19', '25', '23', '12', '10', '11', '11', '19', '22', '19', '13', '19', '25', '12', '14', '25', '24', '12', '21', '16', '19', '24', '25', '17', '14', '12', '17', '25', '17', '11', '18', '19', '24', '25', '22', '19', '18', '22', '21', '14', '16', '18', '25', '21', '13', '11', '10', '21', '19', '11']


In [38]:
# How to convert zipcodes from text to numbers
for k in range(0,len_housing_data):
    housing_data[k][8] = int(housing_data[k][8])  # convert to integer data type and over-write

In [39]:
# print(housing_data[0:5]) # Zipcode is without quotes and not strings; they are now integers
for _ in range(0,6):
    print(housing_data[_], end="\n")

['1', '24', '1757', '6056', '2', '899', '3', '3', 94085, '894']
['2', '10', '1563', '6085', '2', '959', '4', '3', 94085, '861']
['3', '14', '1344', '6089', '2', '865', '4', '3', 94085, '831']
['4', '14', '1215', '6129', '3', '959', '4', '2', 94085, '809']
['5', '24', '1866', '6141', '3', '877', '4', '1', 94085, '890']
['6', '18', '1589', '6148', '2', '920', '3', '0', 84085, '867']


In [40]:
## Reading csv files, how to fix errors in data, replace 84085 with 94085
def read_file_housing(filename):
    file_open = open(filename,"r")
    data_array = []
    for line in iter(file_open):
        if "HomeID" in line:
            continue
        line_no_newline = line.rstrip()
        line2 = line_no_newline.replace("84085","94085")
        line_split = line2.split(",")
        data_array.append(line_split)
    file_open.close()
    return data_array

In [47]:
housing_data2 = read_file_housing("bayarea_home_prices.csv")
for _ in range(0,6):
    print(housing_data2[_], end="\n")

['1', '24', '1757', '6056', '2', '899', '3', '3', '94085', '894']
['2', '10', '1563', '6085', '2', '959', '4', '3', '94085', '861']
['3', '14', '1344', '6089', '2', '865', '4', '3', '94085', '831']
['4', '14', '1215', '6129', '3', '959', '4', '2', '94085', '809']
['5', '24', '1866', '6141', '3', '877', '4', '1', '94085', '890']
['6', '18', '1589', '6148', '2', '920', '3', '0', '94085', '867']


In [44]:
len_housing_data2 = len(housing_data2)
print(len_housing_data2)

100


In [45]:
list_ZipCode2 = []
# for all rows, extract all zipcodes
for k in range(0,len_housing_data2):
    list_ZipCode2.append(int(housing_data2[k][8]))  

In [46]:
print(list_ZipCode2) # Converted to numbers

[94085, 94085, 94085, 94085, 94085, 94085, 94085, 94085, 94085, 94085, 94085, 94085, 94085, 94085, 94085, 95051, 94085, 94085, 95051, 94085, 94085, 94085, 94085, 95051, 94085, 94085, 94085, 95051, 94085, 95051, 95051, 95051, 95051, 95051, 95051, 95051, 85051, 95051, 95051, 95051, 94087, 94087, 95051, 95051, 95051, 94087, 95051, 94087, 94087, 95051, 95051, 95051, 85051, 94087, 95051, 94087, 94087, 94087, 95051, 94087, 94087, 94087, 94087, 94087, 94087, 95014, 94087, 94087, 94087, 94087, 95014, 94087, 95014, 95014, 84087, 84087, 95014, 94087, 94087, 94087, 95014, 95014, 95014, 95014, 85014, 95014, 95014, 95014, 85014, 95014, 95014, 95014, 95014, 95014, 95014, 95014, 95014, 95014, 95014, 95014]


<h1> <b><font color = coral>&#9998; <font color = coral>EXERCISE 9:</h1></font>

In [None]:
"""
The above example shows correcting 84085 -> 94085
Perform other zip code corrections: 
84087 -> 94087,
85014 -> 95014,
85051 -> 95051
Create a table for zip code distribution after corrections:
After:Zipcode,House_Count
94085,25
94087,25
95051,25
95014,25
"""
#Your answer here

In [56]:
import csv

In [69]:
## Data errors:
## 1) Few ZipCode have starting digit to be 8, it should be 9

## Reading csv files and handling field types & fix value errors
def read_file_housing(file_name, max_rows=0, delimiter=','):
    with open(file_name) as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=delimiter)
        records = []
        line_count = 0
        header = next(csv_reader)
        for row in csv_reader:
            
            # handle zip_code column 8
            zip_code = int(row[8]) if row[8] != None else 0
            zip_code = zip_code + 10000 if zip_code < 90000 else zip_code
            row[8] = zip_code
            
            # save to array
            records.append(row)
            
            # increment loop and exit if required
            line_count += 1
            if max_rows != 0 and line_count >= max_rows:
                break
    return records, header

In [52]:
def pretty_print(arr, recs):
    for _ in range(0, recs):
        print(arr[_], end="\n")

In [73]:
housing_data2, header = read_file_housing("bayarea_home_prices.csv")

pretty_print(housing_data2, 10)

['1', '24', '1757', '6056', '2', '899', '3', '3', 94085, '894']
['2', '10', '1563', '6085', '2', '959', '4', '3', 94085, '861']
['3', '14', '1344', '6089', '2', '865', '4', '3', 94085, '831']
['4', '14', '1215', '6129', '3', '959', '4', '2', 94085, '809']
['5', '24', '1866', '6141', '3', '877', '4', '1', 94085, '890']
['6', '18', '1589', '6148', '2', '920', '3', '0', 94085, '867']
['7', '13', '1947', '6183', '3', '959', '3', '1', 94085, '843']
['8', '19', '1839', '6186', '3', '905', '4', '0', 94085, '820']
['9', '17', '1501', '6233', '2', '884', '3', '1', 94085, '874']
['10', '24', '1933', '6276', '2', '95', '4', '1', 94085, '885']


In [74]:
# extract one of the column from 2D array by its column number
def get_column(arr, column):
    return [i[column] for i in arr]

In [77]:
def simple_counter(lst:list):
    counts = {}
    for i in lst:
        counts[i] = counts.get(i, 0) + 1
    return counts

In [155]:
zips = get_column(housing_data2, 8)
print(zips[0:5])

zips_count = simple_counter(zips)

for k,v in zips_count.items():
    print("{0:5d}, {1:3d}".format(k,v))

[94085, 94085, 94085, 94085, 94085]
94085,  25
95051,  25
94087,  25
95014,  25


In [157]:
import json

print(zips_count)
print(json.dumps(zips_count))

{94085: 25, 95051: 25, 94087: 25, 95014: 25}
{"94085": 25, "95051": 25, "94087": 25, "95014": 25}


<h1> <b><font color = coral>&#9998; <font color = coral>EXERCISE 10:</h1></font>

In [None]:
"""
Modify function read_file_housing to multiply incorrect SchoolAPI by 10.
Assume API value to be incorrect if it is a two digit number.
Calculate average School API by zipcode. Print the following:
Average_SchoolAPI,Cnt_of_homes,ZipCode
xyz,mn,abc
"""
#Your answer here

In [92]:
## Data errors:
## 2) Few HighSchoolApi scores have two digits, the ending digit 0 is missing

## Reading csv files and handling field types & fix value errors
def read_file_housing(file_name, max_rows=0, delimiter=','):
    with open(file_name) as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=delimiter)
        records = []
        line_count = 0
        header = next(csv_reader)
        for row in csv_reader:
            # handle hs_api_score column 5
            hs_api_score = int(row[5]) if row[8] != None else 0
            hs_api_score = hs_api_score * 10 if hs_api_score < 100 else hs_api_score
            row[5] = hs_api_score
            
            # handle zip_code column 8
            zip_code = int(row[8]) if row[8] != None else 0
            zip_code = zip_code + 10000 if zip_code < 90000 else zip_code
            row[8] = zip_code
            
            # save to array
            records.append(row)
            
            # increment loop and exit if required
            line_count += 1
            if max_rows != 0 and line_count >= max_rows:
                break
    return records, header

In [93]:
housing_data2, header = read_file_housing("bayarea_home_prices.csv")

pretty_print(housing_data2, 10)

['1', '24', '1757', '6056', '2', 899, '3', '3', 94085, '894']
['2', '10', '1563', '6085', '2', 959, '4', '3', 94085, '861']
['3', '14', '1344', '6089', '2', 865, '4', '3', 94085, '831']
['4', '14', '1215', '6129', '3', 959, '4', '2', 94085, '809']
['5', '24', '1866', '6141', '3', 877, '4', '1', 94085, '890']
['6', '18', '1589', '6148', '2', 920, '3', '0', 94085, '867']
['7', '13', '1947', '6183', '3', 959, '3', '1', 94085, '843']
['8', '19', '1839', '6186', '3', 905, '4', '0', 94085, '820']
['9', '17', '1501', '6233', '2', 884, '3', '1', 94085, '874']
['10', '24', '1933', '6276', '2', 950, '4', '1', 94085, '885']


In [125]:
# calculates Sum of Column1 by Column2
def get_sum_col1_by_col2(arr:list, col1:int, col2:int):
    sums = {}
    for row in arr:
        sums[row[col2]] = sums.get(row[col2], 0) + row[col1]
    return sums

# Print Avg of Dict1 by Dict2
def print_avg_dict1_by_dict2(dict1:dict, dict2:dict):
    for i in dict2.keys():
        print("{0:8.2f}, {1:3d}, {2:5d}".format((dict1[i] / dict2[i]), dict2[i], i))

In [127]:
# get sum of school api by zip_code
api_col = 5
zip_col = 8
api_sum = get_sum_col1_by_col2(housing_data2, api_col, zip_col)
print(api_sum)

# print average school api by zip_code + houses
print_avg_dict1_by_dict2(api_sum, zips_houses)

{94085: 22675, 95051: 22917, 94087: 22481, 95014: 22370}
  907.00,  25, 94085
  916.68,  25, 95051
  899.24,  25, 94087
  894.80,  25, 95014


<h1> <b><font color = coral>&#9998; <font color = coral>EXERCISE 11:</h1></font>

In [None]:
"""
Modify function read_file_housing to replace CarGarage value of 'l' with integer 1
Calculate and print the following:
Car_Garage,Cnt_of_homes
0,m
1,n
2,o
3,p
"""

In [129]:
## Data errors:
## 3) Few CarGarage numbers were entered as letter "l", it should be integer 1 

## Reading csv files and handling field types & fix value errors
def read_file_housing(file_name, max_rows=0, delimiter=','):
    with open(file_name) as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=delimiter)
        records = []
        line_count = 0
        header = next(csv_reader)
        for row in csv_reader:
            # handle hs_api_score column 5
            hs_api_score = int(row[5]) if row[5] != None else 0
            hs_api_score = hs_api_score * 10 if hs_api_score < 100 else hs_api_score
            row[5] = hs_api_score
            
            #handle car_garage column 7
            row[7] = int(row[7]) if row[7]!='l' else 1
            
            # handle zip_code column 8
            zip_code = int(row[8]) if row[8] != None else 0
            zip_code = zip_code + 10000 if zip_code < 90000 else zip_code
            row[8] = zip_code
            
            #handle house_price column 9
            row[9] = int(row[9]) if row[9] != None else 0
            
            # save to array
            records.append(row)
            
            # increment loop and exit if required
            line_count += 1
            if max_rows != 0 and line_count >= max_rows:
                break
    return records, header

In [130]:
housing_data2, header = read_file_housing("bayarea_home_prices.csv")

pretty_print(housing_data2, 10)

['1', '24', '1757', '6056', '2', 899, '3', 3, 94085, 894]
['2', '10', '1563', '6085', '2', 959, '4', 3, 94085, 861]
['3', '14', '1344', '6089', '2', 865, '4', 3, 94085, 831]
['4', '14', '1215', '6129', '3', 959, '4', 2, 94085, 809]
['5', '24', '1866', '6141', '3', 877, '4', 1, 94085, 890]
['6', '18', '1589', '6148', '2', 920, '3', 0, 94085, 867]
['7', '13', '1947', '6183', '3', 959, '3', 1, 94085, 843]
['8', '19', '1839', '6186', '3', 905, '4', 0, 94085, 820]
['9', '17', '1501', '6233', '2', 884, '3', 1, 94085, 874]
['10', '24', '1933', '6276', '2', 950, '4', 1, 94085, 885]


In [124]:
cars = simple_counter(get_column(housing_data2, 7))

for k,v in cars.items():
    print("{0:5d}, {1:3d}".format(k,v))

    3,  32
    2,  19
    1,  18
    0,  31


<h1> <b><font color = coral>&#9998; <font color = coral>EXERCISE 12:</h1></font>

In [None]:
"""
Find the average price of a home in this four zip codes area.
Zipcode,Avg_Price,Cnt_of_homes
"""

In [131]:
# get sum of house prices by zip_code
price_col = 9
zip_col = 8
prices_sum = get_sum_col1_by_col2(housing_data2, price_col, zip_col)
print(prices_sum)

# print average house prices by zip_code + houses
print_avg_dict1_by_dict2(prices_sum, zips_houses)

{94085: 22149, 95051: 25580, 94087: 28787, 95014: 31583}
  885.96,  25, 94085
 1023.20,  25, 95051
 1151.48,  25, 94087
 1263.32,  25, 95014


<h1> <b><font color = coral>&#9998; <font color = coral>EXERCISE 13:</h1></font>

In [None]:
"""
Find the average price of a home in Sunnyvale (94087 and 94085).
Print the output as follows:
The average house price in Sunnyvale based on xx homes is $yyy (thousands).
"""

In [153]:
sv_count = zips_houses[94085] + zips_houses[94087]
sv_avg_price = ( prices_sum[94085] + prices_sum[94087] ) / sv_count * 1000
print("The average house price in Sunnyvale based on {0:2d} homes is ${1:12,.2f}.".format( sv_count, sv_avg_price ))

The average house price in Sunnyvale based on 50 homes is $1,018,720.00.
