## Data ingestion & inspection

### Building DataFrames from Scratch


In [5]:
import numpy as np
import pandas as pd


In [6]:
# Keys of the dictionary are used as column labels
data = {'Weekday':['Sunday','Monday','Tuesday','Wednesday'],'City':['Chicago','New York','Los Angeles','Phoenix']}
indexing = ['S','M','T','W']
users = pd.DataFrame(data,index=indexing)

In [7]:
print(users)

     Weekday         City
S     Sunday      Chicago
M     Monday     New York
T    Tuesday  Los Angeles
W  Wednesday      Phoenix


In [11]:
cities = [chr(x)for x in range(98,104)]
sign = [chr(x)for x in range(107,113)]
list_label_for_columns = ['Column One','Column Two']
data_for_columns = [cities,sign]
print(data_for_columns)
zipped = list(zip(list_label_for_columns,data_for_columns))
print(zipped)
data = dict(zipped)
print(data)
users = pd.DataFrame(data)
users.info()
print('\n')
print(users)
list_labels = ['New One','New Two']    # <--- 
users.columns = list_labels            # Using a list to assign to the columns attribute of the DataFrame
print('\n')
print(users)

# data = pd.read_csv('csv_name.csv',index_col = 'date', parse_dates = True) 
# this sets the index to the name 'date' and puts the date stamp in the columns


[['b', 'c', 'd', 'e', 'f', 'g'], ['k', 'l', 'm', 'n', 'o', 'p']]
[('Column One', ['b', 'c', 'd', 'e', 'f', 'g']), ('Column Two', ['k', 'l', 'm', 'n', 'o', 'p'])]
{'Column One': ['b', 'c', 'd', 'e', 'f', 'g'], 'Column Two': ['k', 'l', 'm', 'n', 'o', 'p']}
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
Column One    6 non-null object
Column Two    6 non-null object
dtypes: object(2)
memory usage: 88.0+ bytes


  Column One Column Two
0          b          k
1          c          l
2          d          m
3          e          n
4          f          o
5          g          p


  New One New Two
0       b       k
1       c       l
2       d       m
3       e       n
4       f       o
5       g       p


## Broadcating

In [11]:
users['fees'] = 0 # Broadcasts to entire column, used to make new columns on the fly
print(users)

  New One New Two  fees
0       b       k     0
1       c       l     0
2       d       m     0
3       e       n     0
4       f       o     0
5       g       p     0


In [12]:
heights = [67.00,68.1,69.1,65.4,55.8]
data = {'heights':heights,'sex':'M'}  # Broadcast the sex:'M' here 
results = pd.DataFrame(data)
print(results)

   heights sex
0     67.0   M
1     68.1   M
2     69.1   M
3     65.4   M
4     55.8   M


### Importing and Exporting Data

CSV Files do not contain column headers

pd.to_csv()  for exporting files that
pd.to_excel()

### Ploting with pandas and pyplot

df.plot(color='red')
plt.title('Whatever you want the title to be')
plt.xlabel(' x-axis label')
plt.ylabel(' y-axis label')
plt.show()


# Exploratory data analysis

df['column'].count() gives the value count of non-null data
df[['column 1','column 2']].count() returns a series count computed of each column
df['column'].mean()
df['column'].std()
df['column'].median()


## Time series in Pandas

### Using pandas to read datetime objects

* read_csv() function
    specify parse_dates=True   (ISO 8601 Format)  yyyy-mm-dd hh:mm:ss
            index_col='Date'  used for indexing the rows with datetimes
      ex.
          df1 = pd.read_csv(filename,parse_dates=['Date'])
          df2 = pd.read_csv(filename,index_col='Date',parse_dates = True)
          
* Pandas supports partial string selection ex. df.loc['Feb 25']
              range selection with slicing     df.loc['Feb 25':'Feb 28']
              
* pd.to_datetime([2015-2-11 20:00','2015-2-11 21:00'....],format='%Y-%m-%d %H:%M)
 
  df.reindex(evening_2_11)
  filling missing data    df.reindex(evening_2_11,method=ffill') forward fill or method = 'bfill'
  
  
  ex.   df2 = df1.reindex(df0.index)
  

### Resampling

Statical methods over different time intervals    When using resampling, we use method chaining
    mean(),sum(),count(),etc.
    
Downsampling
    reduce datetime rows to slower frequency
Upsampling
    the opposite of downsampling
   
    daily_mean = sales.resample('D').mean()   Where 'D' means daily

    sales.resample('D').sum().max().ffill()   Multiple chained

    resampleing frequencies / can be prefixed with a numerical value i.e.  resample('2W') every two weeks
        'min','T'  minute
        'H'        hour
        'D'        day
        'B'        business day
        'W'        week
        'M'        month
        'Q'        quarter
        'A'        annual, year
        
    
  

### Manipulating pandas time series

String methods
sales['Company'].str.upper()

Substring searching -- str.contains('ware') returns bool 
                       str.contains('ware').sum()

Datetime methods  --   sales['Date'].dt.hour                        Can be chained
             central = sales['Date'].dt.tz_localize('Us/Central')
                       central.dt.tz_convert('US/Eastern')
                       
Interpolate missing data   -- population.resample('A').first().interpolate('linear')  returns linear values filled in the NAN

VISUALIZING PANDAS TIME SERIES
    Line types
    Plot types
    Subplots
Plot style format string
    color(k:black)
    marker(.:dot)
    line type(-:solid)
    
subplots=True can be used when looking a different columns that need seperation due to overlap or out of table range

