### Scoping My Data
From what I've been learning through Codecademy, it seems that it's best to scour over the data, and then plan out my analysis accordingly. There are questions that will immediately come to mind, but as the owner of this data, it's easy to answer these questions without really using any analysis. For example:

**On average, what month(s) incurred the most electricity costs?** This one is fairly easy to answer, even from someone who's never really seen this data. Summer months in Central California almost always use the most electricity of the year. The same can be said of the **gas** bill and the winter months.

#### About the Data 

This is pretty self-explanatory; at the end of each month, I grabbed the bill for each utilities type. I then put them in a spreadsheet, totalled them, and then divided by the number of occupants living in the household. When it comes to occupants, I *could* provide context here, but I feel that for my current needs, it will be a time-waste. There will also be instances where data does not align with what you might expect. Unfortunately, I did not document the months where I received credits on bills. Also missing is the few months in 2021 where I had a crypto mining rig eating up about $50 in electricity each month. 

With these small cases, I will be sure to provide a tidbit of information for clarity (where I can). 

Because I will be analysing this data multiple times throughout my DS journey, I think it's best that I start with simple questions. As I learn more advanced methods of analysis, I can then add those questions. For example, taking the average cost of each utility type over each year and then comparing them will be much easier than doing the same but also calculating the average cost *per occupant*, and then adjusting values based on that (that actually seems pretty fun as I'm typing it out). Doing it the easy way, unfortunately, will result in skewed data most likely. But again, this is a subjective set of data and the accuracy isn't very important. 

By the end of this, I only hope to get better awareness on where costs go each month. I will not be using this data to make "better business decisions." It will be an amusing project that will assist in my data science journey!

#### So my first question will be simple: What was the average cost of each utility *per year*? Secondly, which year had the highest cost for *each* utility? Finally, which year was the most expensive *in total* utilities?

***

First, of course, I will be needing to import the data from the csv. Initially, I'll be using the 'csv' python module. Why not use something like a pandas DataFrame? Simply, it's because I'm still very much a beginner. Eventually, I will be using the more advanced methods.

In [10]:
import csv

Next, I'll be using DictReader to grab all the data. I'll be storing in a list for the scope of this part of the project.

In [11]:
cost = []

with open('cost_distribution.csv') as data:
    reader = csv.DictReader(data)
    for row in reader:
        cost.append(row)
        #gas[row['Month']] = (row['Gas'].strip('$'))
        

More Data Stuff:

I'll need to add the year to each row, and then remove the data for 2022. This will be fun. There's probably a much more efficient way of doing this, but for practice's sake, I'll be using a function here.

I will also be stripping 2022 data rather crudely; I will remove any row with 9 items in it, as the rows with Years added to them will have 10.

### Special Thanks to [EddisFargo](https://github.com/EddisFargo) for helping me with this :) 

In [12]:
current_year = 2018
index = 0
while current_year < 2022:
    count = 0
    while count < 12:
        cost[index]["Year"] = current_year
        count += 1
        index += 1
    current_year += 1

for row in cost:
    if (len(row)) == 9:
        cost.remove(row)

There's more that could be done here, like verify the TOTALS and EACHES. Since I double checked in the spreadsheet and saw that each is a result of a calculation and not a raw input, I know these values to be true. For time's sake I will not be doing that, but later on I think it will be something I can work on. 

**So to my first question: What was the average cost of each utility per year?**

For this, I think I want to write a function, since I've not so far in this project. It will have the utility type as the parameter, and then add each year's average for that utility type to a dict, and then return that dict. I will be able to call a specific year by its key/value pairing.

In [13]:
def util_avg(utility):  
    util_avg = {}
    current_year = 2018
    index = 0
    while current_year < 2022:
        util_sum = 0
        count = 0
        while count < 12:
            util_sum += float(cost[index][utility].strip('$'))
            count += 1
            index += 1
        util_avg[current_year] = round(util_sum / 12, 2)
        current_year += 1
    return util_avg

That took longer than expected, but now I can call the function and set to variables and print out results.

In [14]:
gas_avg = util_avg("Gas")

for key, value in gas_avg.items():
    print("The average gas price in {year} was ${cost}.".format(year=key, cost=value))

The average gas price in 2018 was $29.24.
The average gas price in 2019 was $42.95.
The average gas price in 2020 was $34.26.
The average gas price in 2021 was $34.38.


This is great for finding averages for *each* utility, but for this question, I'd like to get them all in one fell swoop. However, i"m going to just switch the year and utility, so I'll be passing the year as the parameter this time. This way I only have to call the function 3 times.

### NOTE: This was not the first iteration of this function. I've spent several hours figuring out different loops, functions, parameters, etc to get ALL years/utilities averages within one function. I even tried using a list of utility types as a parameter and then getting all the averages based on that (future proofing for if I want to get averages for only certain groups of utilities). Iterating through years is much easier for me at this time. Since I won't be using this function for anything else, I'll just print the output of each average instead of returning the data in lists/dicts.

In [22]:
def util_avg(year):
    utils = ["Utility", "Gas", "Electricity", "Water", "Internet"]
    for row in cost:
        if row["Year"] == year:
            print(row)
                

  


util_avg(2019)

{'Month': 'January', 'Utility': '$55.47', 'Gas': '$85.25', 'Electricity': '$105.40', 'Water': '$0.00', 'Internet': '$59.99', 'TOTAL': '$306.11', 'Occupants': '3', 'EACH': '$102.04', 'Year': 2019}
{'Month': 'February', 'Utility': '$55.47', 'Gas': '$115.41', 'Electricity': '$106.76', 'Water': '$37.95', 'Internet': '$81.95', 'TOTAL': '$397.54', 'Occupants': '3', 'EACH': '$132.51', 'Year': 2019}
{'Month': 'March', 'Utility': '$55.47', 'Gas': '$112.98', 'Electricity': '$103.46', 'Water': '$38.02', 'Internet': '$81.95', 'TOTAL': '$391.88', 'Occupants': '3', 'EACH': '$130.63', 'Year': 2019}
{'Month': 'April', 'Utility': '$55.47', 'Gas': '$56.85', 'Electricity': '$112.17', 'Water': '$38.12', 'Internet': '$91.95', 'TOTAL': '$354.56', 'Occupants': '3', 'EACH': '$118.19', 'Year': 2019}
{'Month': 'May', 'Utility': '$55.47', 'Gas': '$21.96', 'Electricity': '$104.05', 'Water': '$19.02', 'Internet': '$81.95', 'TOTAL': '$282.45', 'Occupants': '3', 'EACH': '$94.15', 'Year': 2019}
{'Month': 'June', 'Uti

KeyError: 'Year'