# Phase 1
---

Deliverables
---
1. Problem Statement: Form a title and problem statement that clearly state the problem
and questions you are trying to answer. Additionally:
    - Discuss the background of the problem leading to your objectives.
Why is it a significant problem?
    - Explain the potential of your project to contribute to your problem
domain. Discuss why this contribution is crucial?
2. Data Sources : Collect your data. Your data can come from multiple sources.
    - For example, Medical, Bank, sports, health, Kaggle, Amazon reviews, Twitter, Youtube, Reddit, etc. 
    - This data has to be large enough for the data analysis to yield significance. At least 2000 rows.
3. Data Cleaning/Processing: Your dataset has to be cleaned and properly
processed. Please submit a report where you explain each processing/cleaning step
properly. We expect to see comments and markup for this step. 
    - In order to get full marks you must clearly document 7 (10 for 587 students) distinct processing/cleaning operations.
4. Exploratory Data Analysis (EDA): Perform exploratory data analysis as
defined in the NIST publication [2] and as originally described by John Tukey [4,5].
Record the outcomes and what you learned and how you will use this information.
    - For example, in choosing features (columns) and dropping columns, and in short feature engineering. 
    - You need to demonstrate 7 (10 for 587 students) different, significant and relevant EDA operations and describe how you used these to process the data sets further to provision them for downstream modeling and analytics. Figures and tables should be included where relevant.


Topics
---
- Utility Cost in NYC
    - Motivation: Reduce living expenses, emissions
    - Problem Statement: We want to reduce/figure out the optimal reducers for utility costs
    - Factors: 
        -Wealth (per house)
        - Age of household
        - Household size
        - Location
        - Proximity (Density per house)
        - Season/Date => Classes
        - Pets (per house)
    - [Electric Consumption And Cost (2010 - Feb 2022)](https://data.cityofnewyork.us/Housing-Development/Electric-Consumption-And-Cost-2010-Feb-2022-/jr24-e7cr)
    - [Heating Gas Consumption And Cost (2010 - Feb 2022)](https://data.cityofnewyork.us/Housing-Development/Heating-Gas-Consumption-And-Cost-2010-Feb-2022-/it56-eyq4)
    - [NYC Clean Heat Dataset (Historical)](https://data.cityofnewyork.us/City-Government/NYC-Clean-Heat-Dataset-Historical-/8isn-pgv3)
    - [Natural Gas Consumption by ZIP Code - 2010](https://data.cityofnewyork.us/Environment/Natural-Gas-Consumption-by-ZIP-Code-2010/uedp-fegm)
- Healthcare
    - Cause of death in NYC
        - [Leading Cause of Death in NYC](https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam/data)
- ~~Real Estate~~
    - Buffalo Housing Market for One Family and Two Family Dwellings Segregation Issue
        - Combine with Census Data for income, race, and section-block-lots -> Real Estate Columns: SBL, sale price, housing items
        - Tackling the sale price predictions based on what a dwelling costs in a higher income place versus a lower income place
            - Maybe need to consider specifics of housing (maybe combine with department of housing?)
- ~~Social Trends on News & Politics: Could report on issues discussed by the people vs gov officials~~
    - [Congress on Social Media](https://www.pewresearch.org/internet/dataset/congress-on-social-media-2015-2020/)
        - Congress on Social Media 2015-2020
    - [American Trends Panel Wave 90](https://www.pewresearch.org/internet/dataset/american-trends-panel-wave-90/)
        - Topics: Twitter news attitudes
    - [American Trends Panel Wave 87](https://www.pewresearch.org/politics/dataset/american-trends-panel-wave-87/)
        - Topic: Current political news and topics
- ~~COVID19: Find correlation between COVID data and social media trends~~
    - [COVID-19 Daily Counts of Cases, Hospitalizations, and Deaths](https://data.cityofnewyork.us/Health/COVID-19-Daily-Counts-of-Cases-Hospitalizations-an/rc75-m7u3)
        - Daily count of NYC residents who tested positive for SARS-CoV-2, who were hospitalized with COVID-19, and deaths among COVID-19 patients
        - Note: Dataset currently pulls from https://raw.githubusercontent.com/nychealth/coronavirus-data/master/trends/data-by-day.csv on a daily basis
    - [New York State Statewide COVID-19 Testing](https://health.data.ny.gov/Health/New-York-State-Statewide-COVID-19-Testing/xdss-u53e)
        - Information on the number of tests of individuals for COVID-19 infection performed in New York State beginning March 1, 2020
    - [American Trends Panel Wave 74](https://www.pewresearch.org/internet/dataset/american-trends-panel-wave-74/)
        - Topics: Online harassment, race relations, COVID-19
    - [American Trends Panel Wave 70](https://www.pewresearch.org/internet/dataset/american-trends-panel-wave-70/)
        - Topics: Religion in public life, social media’s role in politics and society, COVID-19 contact tracing
        
*Sources from pew research and data.gov*

In [1]:
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from scipy import stats

In [2]:
elec_data = pd.read_csv('Dataset/Electric_Consumption_And_Cost__2010_-_Feb_2022_.csv')
heat_data = pd.read_csv('Dataset/Heating_Gas_Consumption_And_Cost__2010_-__Feb_2022_.csv')
#heat_data.drop(['Meter AMR','Meter Scope','Funding Source','Vendor Name','Estimated','Bill Analyzed'], axis=1)
elec_data = elec_data.drop(['Meter Scope','Funding Source','Vendor Name','Estimated','Bill Analyzed','Meter AMR'], axis=1)
elec_data = elec_data.dropna()
elec_data['Revenue Month'] = pd.to_datetime(elec_data['Revenue Month'], yearfirst=True, infer_datetime_format=True, format= '%y%m')
elec_data.shape

  exec(code_obj, self.user_global_ns, self.user_ns)


(394156, 21)

In [3]:
import datetime

gtY2020 = datetime.datetime(2020,12,31)
lsY2022 = datetime.datetime(2022,1,1)
elec_data2020s = elec_data[elec_data['Revenue Month']>gtY2020]
#elec_data2020s
elec_data2020 = elec_data2020s[elec_data2020s['Revenue Month']<lsY2022]
elec_data2020

Unnamed: 0,Development Name,Borough,Account Name,Location,TDS #,EDP,RC Code,AMP #,UMIS BILL ID,Revenue Month,...,Service End Date,# days,Meter Number,Current Charges,Rate Class,Consumption (KWH),KWH Charges,Consumption (KW),KW Charges,Other charges
314913,ADAMS,BRONX,ADAMS,BLD 05,118.0,248,B011800,NY005001180P,9852732,2021-01-01,...,01/26/2021,34.0,7223256,14204.10,GOV/NYC/068,134400,4972.8,208.00,2238.08,6993.22
314914,ADAMS,BRONX,ADAMS,BLD 05,118.0,248,B011800,NY005001180P,9853077,2021-02-01,...,02/25/2021,30.0,7223256,17486.16,GOV/NYC/068,119200,4410.4,208.00,2238.08,10837.68
314927,ADAMS,BRONX,ADAMS,BLD 06,118.0,248,B011800,NY005001180P,9745842,2021-01-01,...,01/26/2021,34.0,9985100,11376.21,GOV/NYC/068,82400,3048.8,144.00,1549.44,6777.97
314928,ADAMS,BRONX,ADAMS,BLD 06,118.0,248,B011800,NY005001180P,9853078,2021-02-01,...,02/25/2021,30.0,9985100,12514.91,GOV/NYC/068,77200,2856.4,164.00,1764.64,7893.87
314941,ADAMS,BRONX,ADAMS,BLD 07,118.0,248,B011800,NY005001180P,9745843,2021-01-01,...,01/26/2021,34.0,9983550,10225.37,GOV/NYC/068,79200,2930.4,124.00,1334.24,5960.73
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
407020,WYCKOFF GARDENS,BROOKLYN,WYCKOFF GARDENS,BLD 02,163.0,272,K016300,NY005011630P,10846474,2021-12-01,...,12/23/2021,31.0,1096666,4283.33,GOV/NYC/068,0,0.0,140.88,1515.87,2767.46
407021,WYCKOFF GARDENS,BROOKLYN,WYCKOFF GARDENS,BLD 03,163.0,272,K016300,NY005011630P,10846474,2021-12-01,...,12/23/2021,31.0,1096667,4976.54,GOV/NYC/068,0,0.0,163.68,1761.20,3215.34
407022,WYCKOFF GARDENS,BROOKLYN,WYCKOFF GARDENS,BLD 03,163.0,272,K016300,NY005011630P,10846474,2021-12-01,...,12/23/2021,31.0,8096664,10371.29,GOV/NYC/068,99200,3670.4,0.00,0.00,6700.89
407023,WYCKOFF GARDENS,BROOKLYN,WYCKOFF GARDENS,BLD 02,163.0,272,K016300,NY005011630P,10846474,2021-12-01,...,12/23/2021,31.0,8096666,8782.14,GOV/NYC/068,84000,3108.0,0.00,0.00,5674.14


In [None]:
plt.figure(figsize=(500,200))

sns.catplot(data=elec_data2020,x='Revenue Month',y='Current Charges',hue='Borough',kind='swarm')