1. A cleaning services company compiled the following data related to the annual profit of the firm to its
annual Facebook advertising campaign (measured in thousands) as shown in the table below

| Advertising Expenditure | Profit |
|------------------------|--------|
| 12                     | 60     |
| 14                     | 70     |
| 17                     | 90     |
| 21                     | 100    |
| 26                     | 100    |
| 30                     | 120    |

a) Find the best least squares fit to the data in the form of a straight line given by y = mx + c by
writing a numpy program.

b) Plot the points and least square fit line using matplotlib.

c) Calculate the profit if the company allocates in its next FB campaign with a 50,000 budget allocation. Report the value in $ currency.

In [None]:
import numpy as np
#Cannot just import matplot lib here. Need pyplot.
import matplotlib.pyplot as plot
#best least squares implies the use of NumPy polyfit

#Let's start by creating arrays that represent the table.
adv_exp = np.array([12, 14, 17, 21, 26, 30])
profit = np.array([60,70,90,100,100,120])

# A best-squares fit is achieved with a degree. We will arbitrarily assume a degree of 3.as_integer_ratio

coefficients = np.polyfit(adv_exp, profit, deg = 3)

x = np.linspace(0,30,100)

y_fit = np.polyval(coefficients, x)

#Creating scatter plot with expense as x axis and profit as y. 
    #Can include a third term label = 'xyz' if desired
plot.scatter(adv_exp, profit)
#This overlays the 'line of best fit'.
plot.plot(x, y_fit)



2) Using code, open and read the contents of the files contained with the “bunchofJSONS.zip”
file. For each file read, enter the contents of the JSON file into a Pandas Dataframe. The value
of the ‘volume’ key in each of the JSON file is expressed in cubic mm. Update each value in
the dataframe to convert the value into cubic inches (using the .apply() or .map() method).
Calculate the total volume of all the objects in cubic inches.

In [None]:
import zipfile
import json
import pandas as pd

#Open the .zip and scrape a list of all files with the .namelist() function:
with zipfile.ZipFile('bunchofJsons.zip', 'r') as bunch_of_jsons:
    #This function is specific to zipfile.
    file_list = bunch_of_jsons.namelist()
#Verify that our file list is working:
# print(file_list[:3])

step_one_table_list = []

for file in file_list:
    with zipfile.ZipFile('bunchofJsons.zip', 'r') as bunch_of_jsons:
        with bunch_of_jsons.open(file) as open_file:
            #I don't need to nest commands here because json.load serves to append data to 'data'.
            data = json.load(open_file)

    # print(file, data)

    #read through each dictionary key piecemeal and re-write it as a converted value
    #Needed to manually open a .json to find the key here.
    converted_volumes = data['compProperties']["volume"] / 16387.064

    #appending converted values to data here
    data['converted_volumes'] = converted_volumes

    new_data_frame = pd.DataFrame(data, index = [0])
    step_one_table_list.append(new_data_frame)
    
master_dataframe = pd.concat(step_one_table_list, ignore_index = True)

total_vol = master_dataframe['converted_volumes'].sum()

print(f"Total volume is: {total_vol}")

3. Read the ‘sales.csv’ file into a pandas dataframe. Split the Location column into two additional
columns of City and State. The new dataframe should retain the original column but with two
additional columns added

In [None]:
import csv
import pandas as pd
with open('sales.csv','r'):
    sales_dataframe = pd.read_csv('sales.csv',',')

#Preliminary data analysis:
print(sales_dataframe.head())

#Python data isn't 'sequential' and can be thought of as happening 'all at once.'
#Therefore, we need to break the link between the new dataframe and old one - they can't simply be equal at some point.
#.copy() accomplishes this.
new_sales_dataframe = sales_dataframe.copy()

#Here, we're simply appending values from old to new, but splitting them.
#expand = True means that when the value is split, it isn't sent as a two-item list to a single column.
#Instead, each item gets a column.
new_sales_dataframe[['City', 'State']] = sales_dataframe['Location'].str.split(', ', expand = True)

# Printing data to check.
print(new_sales_dataframe.head())


4. With respect to the same dataframe created above from sales.csv, answer the following
questions:
a. Which Item‐Type was sold the most?
b. Which Item‐Type generated the most revenue?
c. For items that were sold below 1000 units, which item‐type generated the most
total‐profit?
d. What item‐types were sold in the State – ‘AZ’?

In [None]:
#a, which item-type was sold most?

#.sum() and .nlargest(1) are chained so perform two operations in one line.
#.sum() serves to sum the sales of each item.
#.nlargest produces the largest 'n' (number) and only the largest (1). We could find the 3 largest, or 4, etc.
most_sold_item_type = sales_dataframe.groupby('Item Type')['Units Sold'].sum().nlargest(1)
print(f"The item type that was sold the most is {most_sold_item_type.index[0]} with {most_sold_item_type.values[0]} units sold.")

#b) Which item-type generated the most revenue?

#Same code as above.
highest_revenue_item_type = sales_dataframe.groupby('Item Type')['Total Revenue'].sum().nlargest(1)
print(f"The item type that generated the most revenue is {highest_revenue_item_type.index[0]} with a total revenue of ${highest_revenue_item_type.values[0]:,.2f}.")

#c) Which items generated the most profit with a unit quantity < 1000 ?

#The line below can be read as 'the new dataframe is equal to the old dataframe, such that the old dataframe entry is less than 1000'
sold_below_1000 = sales_dataframe[sales_dataframe['Units Sold'] < 1000]
most_profitable_below_1000 = sold_below_1000.groupby('Item Type')['Total Profit'].sum().nlargest(1)
print(f"For items sold below 1000 units, the most profitable item type is {most_profitable_below_1000.index[0]} with a total profit of ${most_profitable_below_1000.values[0]:,.2f}.")

#d) Which items were sold in AZ?
items_sold_in_az = sales_dataframe.loc[new_sales_dataframe['State'] == 'AZ', 'Item Type'].unique()
print(f"The following item types were sold in Arizona: {', '.join(items_sold_in_az)}")



5. Given the following table, write code that perform the following set of steps.

| Column 1 | Column 2 | Column 3 | Column 4 |
|----------|----------|----------|----------|
| A        | B        | C        | D        |
| VINE     | THE WONDER | PIZZA    |          |
| BEAUTY   | SIRE     | NUN      | NONE     |
| COOPERATION | EAST     | NOBODY OF |          |
| NOON     | OOLONG   | THE UNIVERSE |       |
| AIRPLANE | MY       | SUBTERFUGE DEED |    |
| NEVER    | WORLD    | RESIN    | DONOR    |
| TOO      | TWO      | CLOUD    | EVEN     |
| LIES     | SERENDIPITY | PRIZE   | SWIFT    |
| RAPID    | OBOE     | ANYBODY IN |        |
| THE MULTITUDE | SPEEDY | MATHEMATICAL |       |
| PIZZAZZ  | SURE     | DIVERSITY | RUIN     |
| RAINBOW  | WARE     | WEAR     | MOON     |
| SOMEONE OF | STAR    | ABBA     |          |
| KAYAK    | MONOPOLY | ITS      | EYE      |

The above table is stored as a ‘word‐table.csv’ file to be read into a dataframe. The sets of operations to be performed on the above table is as follows:(note, that these operations must operate in the order displayed below. That is the result of operation 1 feeds into operaton 2, which then feeds into operation 3).

Read the table into a Pandas Dataframe
Perform the following operations:
Replace words that appear to the immediate right of the word ‘THE’ with the value ‘NONE’

In columns B and D, replace all words that contain two or more O’s with the value ‘NONE'

In [78]:
import pandas as pd
import numpy as np
import regex as re

#Place the csv into buffer here...
control_copy_table = pd.read_csv('word-table.csv')

step_one_table = pd.read_csv('word-table.csv')

print(f"{control_copy_table.head()}\n")

#First, replace words that appear to the right of THE with the value 'NONE'

#Need to learn a better approach here instead of using the column names. Perhaps I could substitute the column name
#in each command for a pointer to the column name which references the index.
for i, row in step_one_table.iterrows():
    if 'THE' in row['A']:
        step_one_table.at[i, 'B'] = 'NONE'

for i, row in step_one_table.iterrows():
    if 'THE' in row['B']:
        step_one_table.at[i, 'C'] = 'NONE'

for i, row in step_one_table.iterrows():
    if 'THE' in row['C']:
        step_one_table.at[i, 'D'] = 'NONE'

print(f"{step_one_table.head()}\n")

#Then, replace all words in column B and D that contain two or more Os with 'NONE'
        #This one didn't work. Later on, I realized that 'O'*2 = 'OO'. This won't help for 'MONOPOLY.'
            # step_one_table[['B', 'D']] = step_one_table[['B', 'D']].applymap(lambda x: x if 'O'*2 not in x else 'NONE')

#Figured we could create a mask with the locations of each cell, then use that to replace with NONE.
mask = step_one_table.apply(lambda x: x.str.count('O')).ge(2).any(axis=1)

# #This method gets me closer, but I end up using lambda to evaluate each row and changing everything to NONE.
# step_one_table.loc[mask] = 'NONE'

step_one_table[mask] = step_one_table[mask].apply(lambda x: x.str.replace(r'\b\w*OO\w*\b', 'NONE'))


print(step_one_table)

             A       B           C         D
0         VINE     THE      WONDER     PIZZA
1       BEAUTY    SIRE         NUN      NONE
2  COOPERATION    EAST      NOBODY        OF
3         NOON  OOLONG         THE  UNIVERSE
4     AIRPLANE      MY  SUBTERFUGE      DEED

             A       B           C      D
0         VINE     THE        NONE  PIZZA
1       BEAUTY    SIRE         NUN   NONE
2  COOPERATION    EAST      NOBODY     OF
3         NOON  OOLONG         THE   NONE
4     AIRPLANE      MY  SUBTERFUGE   DEED

           A            B           C             D
0       VINE          THE        NONE         PIZZA
1     BEAUTY         SIRE         NUN          NONE
2       NONE         EAST      NOBODY            OF
3       NONE         NONE         THE          NONE
4   AIRPLANE           MY  SUBTERFUGE          DEED
5      NEVER        WORLD       RESIN         DONOR
6       NONE          TWO       CLOUD          EVEN
7       LIES  SERENDIPITY       PRIZE         SWIFT
8      R

  step_one_table[mask] = step_one_table[mask].apply(lambda x: x.str.replace(r'\b\w*OO\w*\b', 'NONE'))
