#  1. Acquire the Data

> "Data is the new oil"

**Ways to acquire data** (typical data source)

- Download from an internal system
- Obtained from client, or other 3rd party
- Extracted from a web-based API
- Scraped from a website
- Extracted from a PDF file
- Gathered manually and recorded

**Data Formats**
- Flat files (e.g. csv)
- Excel files
- Database (e.g. MySQL)
- JSON
- HDFS (Hadoop)

Two Datasets
- Price of Weed in US
- Demographic data by US State 


## 1.1 - Crowdsource the Price of Weed dataset

![Price of weed website](http://www.priceofweed.com/app/misc/images/logo.png)

The Price of Weed website - http://www.priceofweed.com/

Crowdsources the price paid by people on the street to get weed. Self Reported.
- **Location** is auto detected or can be choosen
- **Quality** is classified in three categories 
    - High 
    - Medium
    - Low
- **Price by weight**
    - an ounce
    - a half ounce
    - a quarter
    - an eighth
    - 10 grams
    - 5 grams
    - 1 gram
- **Strain** (though not showed in the dataset)

Reported at individual transaction level

Here is a sample data set from United States - http://www.priceofweed.com/prices/United-States.html

See note - *Averages are corrected for outliers based on standard deviation from the mean.*


## 1.2  Scrape the data

[Frank Bi](https://github.com/frankbi) from The Verge wrote a script to scrape the data daily. The daily prices are available on github at https://github.com/frankbi/price-of-weed

Here is sample data from one day - 23rd July 2015 - https://github.com/frankbi/price-of-weed/blob/master/data/weedprices23072015.csv


## 1.3  Combine the data

All the csv files for each day were combined into one large csv. Done by YHAT.

http://blog.yhathq.com/posts/7-funny-datasets.html


## 1.4 Key Questions / Assumptions

> Data is an abstraction of the reality.

- What assumptions have been in this entire data collections process?
- Are we aware of the assumptions in this process?
- How to ensure that the data is accurate or representative for the question we are trying to answer? 


## 1.5 Loading the Data


In [1]:
# Load the libraries
import pandas as pd
import numpy as np

In [2]:
# Load the dataset
df = pd.read_("D:/Self-Study/PYTHON/Contact Class/Week_6 - Pandas/data/Weed_Price.csv")

In [5]:
df.shape

(22899, 8)

In [4]:
print(df.head(10))

                  State   HighQ  HighQN    MedQ  MedQN    LowQ  LowQN  \
0               Alabama  339.06    1042  198.64    933  149.49    123   
1                Alaska  288.75     252  260.60    297  388.58     26   
2               Arizona  303.31    1941  209.35   1625  189.45    222   
3              Arkansas  361.85     576  185.62    544  125.87    112   
4            California  248.78   12096  193.56  12812  192.92    778   
5              Colorado  236.31    2161  195.29   1728  213.50    128   
6           Connecticut  347.90    1294  273.97   1316  257.36     91   
7              Delaware  373.18     347  226.25    273  199.88     34   
8  District of Columbia  352.26     433  295.67    349  213.72     39   
9               Florida  306.43    6506  220.03   5237  158.26    514   

         date  
0  2014-01-01  
1  2014-01-01  
2  2014-01-01  
3  2014-01-01  
4  2014-01-01  
5  2014-01-01  
6  2014-01-01  
7  2014-01-01  
8  2014-01-01  
9  2014-01-01  


In [4]:
df.tail()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date
22894,Virginia,364.98,3513,293.12,3079,,284,2014-12-31
22895,Washington,233.05,3337,189.92,3562,,160,2014-12-31
22896,West Virginia,359.35,551,224.03,545,,60,2014-12-31
22897,Wisconsin,350.52,2244,272.71,2221,,167,2014-12-31
22898,Wyoming,322.27,131,351.86,197,,12,2014-12-31


In [5]:
col = df.columns
col

Index(['State', 'HighQ', 'HighQN', 'MedQ', 'MedQN', 'LowQ', 'LowQN', 'date'], dtype='object')

In [6]:
df.dtypes

State      object
HighQ     float64
HighQN      int64
MedQ      float64
MedQN       int64
LowQ      float64
LowQN       int64
date       object
dtype: object

In [9]:
# Shape of the dateset - rows & columns
df.shape,df.shape[0],df.shape[1]

((22899, 8), 22899, 8)

In [10]:
df.head()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date
0,Alabama,339.06,1042,198.64,933,149.49,123,2014-01-01
1,Alaska,288.75,252,260.6,297,388.58,26,2014-01-01
2,Arizona,303.31,1941,209.35,1625,189.45,222,2014-01-01
3,Arkansas,361.85,576,185.62,544,125.87,112,2014-01-01
4,California,248.78,12096,193.56,12812,192.92,778,2014-01-01


In [11]:
# Check for type of each variable
df.dtypes

State      object
HighQ     float64
HighQN      int64
MedQ      float64
MedQN       int64
LowQ      float64
LowQN       int64
date       object
dtype: object

In [23]:
# Lets load this again with date as date type
df = pd.read_csv("Weed_Price.csv", parse_dates=[-1])

In [24]:
# Now check for type for each row
df.dtypes

State             object
HighQ            float64
HighQN             int64
MedQ             float64
MedQN              int64
LowQ             float64
LowQN              int64
date      datetime64[ns]
dtype: object

In [80]:
# Load the dataset
df = pd.read_csv("/home/ishwarya/Downloads/Python/Weed_Price.csv")

In [81]:
df["date"]=pd.to_datetime(df["date"])

In [82]:
df.dtypes

State             object
HighQ            float64
HighQN             int64
MedQ             float64
MedQN              int64
LowQ             float64
LowQN              int64
date      datetime64[ns]
dtype: object

In [28]:
# Get the index of all rows
df.index

RangeIndex(start=0, stop=22899, step=1)

## 1.6 Viewing the Data

In [32]:
# Get specific rows
df[30:35]

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date
30,Oklahoma,364.57,754,228.12,664,121.6,131,2014-01-01
31,Oregon,213.49,1612,188.55,1336,175.87,61,2014-01-01
32,Maryland,380.03,1963,248.71,1546,196.59,152,2014-01-01
33,Massachusetts,354.84,2878,297.28,2467,217.89,219,2014-01-01
34,Michigan,299.36,3048,277.63,2529,227.48,230,2014-01-01


In [15]:
# Can we access a specific columns
df["State"]

0                     Alabama
1                      Alaska
2                     Arizona
3                    Arkansas
4                  California
5                    Colorado
6                 Connecticut
7                    Delaware
8        District of Columbia
9                     Florida
10                    Georgia
11                     Hawaii
12                      Idaho
13                   Illinois
14                    Indiana
15                       Iowa
16                     Kansas
17                   Kentucky
18                  Louisiana
19                      Maine
20                    Montana
21                   Nebraska
22                     Nevada
23              New Hampshire
24                 New Jersey
25                 New Mexico
26                   New York
27             North Carolina
28               North Dakota
29                       Ohio
                 ...         
22869                Nebraska
22870                  Nevada
22871     

In [16]:
# Using the dot notation
df.State


0                     Alabama
1                      Alaska
2                     Arizona
3                    Arkansas
4                  California
5                    Colorado
6                 Connecticut
7                    Delaware
8        District of Columbia
9                     Florida
10                    Georgia
11                     Hawaii
12                      Idaho
13                   Illinois
14                    Indiana
15                       Iowa
16                     Kansas
17                   Kentucky
18                  Louisiana
19                      Maine
20                    Montana
21                   Nebraska
22                     Nevada
23              New Hampshire
24                 New Jersey
25                 New Mexico
26                   New York
27             North Carolina
28               North Dakota
29                       Ohio
                 ...         
22869                Nebraska
22870                  Nevada
22871     

In [4]:
df[['State','HighQ']]

Unnamed: 0,State,HighQ
0,Alabama,339.06
1,Alaska,288.75
2,Arizona,303.31
3,Arkansas,361.85
4,California,248.78
...,...,...
22894,Virginia,364.98
22895,Washington,233.05
22896,West Virginia,359.35
22897,Wisconsin,350.52


In [18]:
df[df.columns[1]]

0        339.06
1        288.75
2        303.31
3        361.85
4        248.78
5        236.31
6        347.90
7        373.18
8        352.26
9        306.43
10       332.21
11       310.96
12       276.05
13       359.74
14       336.80
15       371.70
16       353.51
17       337.33
18       377.71
19       321.10
20       279.73
21       352.64
22       269.61
23       366.47
24       354.91
25       293.86
26       351.98
27       355.88
28       412.84
29       336.63
          ...  
22869    335.65
22870    266.33
22871    353.87
22872    345.36
22873    289.20
22874    343.89
22875    345.28
22876    393.94
22877    332.70
22878    354.61
22879    205.80
22880    366.04
22881    345.66
22882    288.10
22883    351.60
22884    334.70
22885    351.55
22886    363.15
22887    317.15
22888    336.98
22889    357.05
22890    348.95
22891    334.18
22892    290.42
22893    371.23
22894    364.98
22895    233.05
22896    359.35
22897    350.52
22898    322.27
Name: HighQ, Length: 228

In [19]:
# Selecting specific column and rows
df[0:5]["State"]

0       Alabama
1        Alaska
2       Arizona
3      Arkansas
4    California
Name: State, dtype: object

In [20]:
# Works both ways
df["State"][0:5]

0       Alabama
1        Alaska
2       Arizona
3      Arkansas
4    California
Name: State, dtype: object

## 1.7 Slicing columns using pandas

In [21]:
df.index

RangeIndex(start=0, stop=22899, step=1)

In [22]:
df.head()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date
0,Alabama,339.06,1042,198.64,933,149.49,123,2014-01-01
1,Alaska,288.75,252,260.6,297,388.58,26,2014-01-01
2,Arizona,303.31,1941,209.35,1625,189.45,222,2014-01-01
3,Arkansas,361.85,576,185.62,544,125.87,112,2014-01-01
4,California,248.78,12096,193.56,12812,192.92,778,2014-01-01


In [101]:
df.iloc[1,0]

'Alaska'

In [103]:
df.loc[1,"State"]

'Alaska'

In [102]:
df.ix[1,'State']

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


'Alaska'

In [86]:
import pandas as pd , numpy as np
s = pd.Series([45,54,63,72,81,90,99,108,117,126], index=[49,48,47,46,45, 1,1, 2, 3, 4],name='My_Series')

In [87]:
s

49     45
48     54
47     63
46     72
45     81
1      90
1      99
2     108
3     117
4     126
Name: My_Series, dtype: int64

In [88]:
s.loc[1]

1    90
1    99
Name: My_Series, dtype: int64

In [89]:
s.iloc[1]

54

In [5]:
df.head(20)

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date
0,Alabama,339.06,1042,198.64,933,149.49,123,2014-01-01
1,Alaska,288.75,252,260.6,297,388.58,26,2014-01-01
2,Arizona,303.31,1941,209.35,1625,189.45,222,2014-01-01
3,Arkansas,361.85,576,185.62,544,125.87,112,2014-01-01
4,California,248.78,12096,193.56,12812,192.92,778,2014-01-01
5,Colorado,236.31,2161,195.29,1728,213.5,128,2014-01-01
6,Connecticut,347.9,1294,273.97,1316,257.36,91,2014-01-01
7,Delaware,373.18,347,226.25,273,199.88,34,2014-01-01
8,District of Columbia,352.26,433,295.67,349,213.72,39,2014-01-01
9,Florida,306.43,6506,220.03,5237,158.26,514,2014-01-01


In [105]:
# loc can be used to change values in the data based on condition
df.loc[df["LowQN"]==123,"LowQN"]=321

In [106]:
df.head()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date
0,Alabama,339.06,1042,198.64,933,149.49,321,2014-01-01
1,Alaska,288.75,252,260.6,297,388.58,26,2014-01-01
2,Arizona,303.31,1941,209.35,1625,189.45,222,2014-01-01
3,Arkansas,361.85,576,185.62,544,125.87,112,2014-01-01
4,California,248.78,12096,193.56,12812,192.92,778,2014-01-01


In [33]:
#General advice: 
'''If you're only indexing using labels, or only indexing using integer positions, 
stick with loc or iloc to avoid unexpected results.
If however you have a DataFrame and you want to mix label and positional index types, ix lets you do this
'''

"If you're only indexing using labels, or only indexing using integer positions, \nstick with loc or iloc to avoid unexpected results.\nIf however you have a DataFrame and you want to mix label and positional index types, ix lets you do this\n"

# Exercise

1) Load the Demographics_State.csv dataset

2) Show the five first rows of the dataset

3) Select the column with the State name in the data frame