## MORE TIPS ON READING IN DATA IN PANDAS & INITIAL PROCESSING
- There are features in files that Pandas can't handle - leads to errors when trying to read in the data  
<br>
- And other features we want to remove/change so that our data is cleaner and has the appropriate structure 

In [1]:
import pandas as pd
import chardet
import urllib.request # module for downloading data from URLs

In [2]:
# Added after recording this lesson
filename = 'CU_data_July2017_full.csv'
url = 'http://rfd.atmos.uiuc.edu/Atms517/week6/'+filename

# Downloads what is at the address passed in as 'url' and saves as 'filename'
urllib.request.urlretrieve(url, filename)

('CU_data_July2017_full.csv', <http.client.HTTPMessage at 0x1aa0fe99848>)

### (1)  How to view contents of a CSV file 

In [3]:
# This file is small enough that we could just open it in Excel or otherwise
# But it's good practice to not do this manually in case our file is prohibitively large 

# In prior version of Pandas didn't need to specify encoding for what we're doing in this cell
t = open('CU_data_July2017_full.csv',encoding='iso-8859-1')

# Prints out each line, beginning with the first
for line in (open('CU_data_July2017_full.csv',encoding='iso-8859-1').readlines()):
    print(line)
    
# Or if very large, can go line-by-line
    
# May also be useful to read file in reverse!
#for line in reversed(open('CU_data_July2017_full.csv').readlines()):
#    print(line)

Jul-17,,,,,,,,,,,,,,,

,Midnight to Midnight,,,,,,,,8 AM Observations,,,,,,

,Temperature (F),,,,,,Degree Days,,Precipitation (inches),,,Soil Temperature (F),,Temp. (F),

Day,High,Time (CST),Low,Time (CST),Mean,Depart,Heating,Cooling,Precipitation,Snowfall,Snow Depth,4 inch,8 inch,Morning Low,Comments

1,84,3:25 PM,66,11:57 PM,75,0,0,10,0.55,0,0,73,74,66,thunderstorms

2,88,3:17 PM,61,4:48 AM,75,0,0,10,0,0,0,72,74,61,

3,89,2:07 PM,67,5:03 AM,78,3,0,13,0,0,0,75,76,67,

4,89,3:07 PM,67,4:29 AM,78,3,0,13,0,0,0,76,76,67,

5,86,11:53 AM,70,4:10 AM,78,3,0,13,0,0,0,77,77,70,

6,89,4:46 PM,70,4:29 AM,80,5,0,15,0,0,0,77,78,70,

7,92,2:14 PM,69,3:42 AM,81,6,0,16,0,0,0,77,78,69,

8,83,2:31 PM,64,11:59 PM,74,-1,0,9,0,0,0,75,77,65,

9,88,1:44 PM,59,3:15 AM,74,-1,0,9,0,0,0,74,76,59,

10,90,3:17 PM,68,9:16 AM,79,4,0,14,0.04,0,0,77,78,72,rain

11,85,5:16 PM,69,11:02 AM,77,2,0,12,0.09,0,0,,,71,"rain, thunder"

12,,2:54 PM,69,4:29 AM,,,,,0.68,0,0,75,76,69,thunderstorms

13,85,1:25 PM,74,7:27 AM,80,5,0,

### (2)  Skipping rows near beginning of file & end of file
- In this example, we'll skip Lines 1-3 - contains useful information, but not information we want to incorporate into our data structure
<br><br>
- **Skipping rows near beginning?**: use *skiprows* option in pd.read_csv - different ways to pass in which rows to skip (here, # of rows)
<br><br>
- **Skipping rows near end?** use *skipfooter* option in pd.read_csv - how many rows from the end would you like to skip?
     - note: also must specify engine='python' in pd.read_csv to use this option
     - default engine of 'c' is more efficient, but 'python' is more feature-complete...and permits use of skipfooter
<br><br>
- Trying this, however:<br><br>
&nbsp;&nbsp;tdata = pd.read_csv('CU_data_July2017_full.csv',index_col='Day',**skiprows = 3,skipfooter=14,engine='python'**)
<br><br>
&nbsp;leads to:
<br><br>
&nbsp;&nbsp; *UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 60: invalid start byte*

### (3) Determining encoding
- Degree symbols and other symbols may not be read correctly using default encoding in Pandas
<br><br>
- Solution:
  - (1) evaluate what your encoding is using "chardet" package
  - (2) pass your encoding into the *encoding* option in pd.read_csv

In [4]:
# Can you tell you your encoding 
edata = open('CU_data_July2017_full.csv', 'rb').read()
result = chardet.detect(edata)
encode = result['encoding']

tdata = pd.read_csv('CU_data_July2017_full.csv',index_col='Day',skiprows = 3,skipfooter=14,engine='python',encoding=encode) # 

### (4) Viewing what Pandas has read in - first few lines, last few lines
 - *data.head() and data.tail()* show us snippets of the file that we have read in
 - you can specify how many lines are shown

In [5]:
print(tdata.head(15)) # default 5 lines

     High Time (CST)  Low Time (CST).1  Mean  Depart  Heating  Cooling  \
Day                                                                      
1    84.0    3:25 PM   66     11:57 PM  75.0     0.0      0.0     10.0   
2    88.0    3:17 PM   61      4:48 AM  75.0     0.0      0.0     10.0   
3    89.0    2:07 PM   67      5:03 AM  78.0     3.0      0.0     13.0   
4    89.0    3:07 PM   67      4:29 AM  78.0     3.0      0.0     13.0   
5    86.0   11:53 AM   70      4:10 AM  78.0     3.0      0.0     13.0   
6    89.0    4:46 PM   70      4:29 AM  80.0     5.0      0.0     15.0   
7    92.0    2:14 PM   69      3:42 AM  81.0     6.0      0.0     16.0   
8    83.0    2:31 PM   64     11:59 PM  74.0    -1.0      0.0      9.0   
9    88.0    1:44 PM   59      3:15 AM  74.0    -1.0      0.0      9.0   
10   90.0    3:17 PM   68      9:16 AM  79.0     4.0      0.0     14.0   
11   85.0    5:16 PM   69     11:02 AM  77.0     2.0      0.0     12.0   
12    NaN    2:54 PM   69      4:29 AM

In [6]:
print(tdata.tail())

     High Time (CST)  Low Time (CST).1  Mean  Depart  Heating  Cooling  \
Day                                                                      
27   83.0    1:38 PM   69     11:59 PM  76.0     2.0      0.0     11.0   
28   83.0    2:37 PM   65     11:59 PM  74.0     0.0      0.0      9.0   
29   79.0    1:57 PM   60      5:47 AM  70.0    -4.0      0.0      5.0   
30   84.0    2:56 PM   60      4:30 AM  72.0    -2.0      0.0      7.0   
31   85.0    1:19 PM   64      2:04 AM  75.0     1.0      0.0     10.0   

     Precipitation  Snowfall  Snow Depth  4 inch  8 inch  Morning Low Comments  
Day                                                                             
27            0.27         0           0    79.0    80.0           71     rain  
28            0.03         0           0    76.0    78.0           66     rain  
29            0.00         0           0    73.0    76.0           60      NaN  
30            0.00         0           0    73.0    76.0           60      N

### (5) Dropping some unwanted columns 
 - Use *data.drop(list)* 
 - Can alternatively specify which columns to read in pd.read_csv

In [7]:
# Let's drop some columns!
                                 # Same column name so Pandas made the second occurence slightly different!
tdata = tdata.drop(['Time (CST)','Time (CST).1', '4 inch','8 inch','Morning Low','Comments'],axis=1)
print(tdata.head())

     High  Low  Mean  Depart  Heating  Cooling  Precipitation  Snowfall  \
Day                                                                       
1    84.0   66  75.0     0.0      0.0     10.0           0.55         0   
2    88.0   61  75.0     0.0      0.0     10.0           0.00         0   
3    89.0   67  78.0     3.0      0.0     13.0           0.00         0   
4    89.0   67  78.0     3.0      0.0     13.0           0.00         0   
5    86.0   70  78.0     3.0      0.0     13.0           0.00         0   

     Snow Depth  
Day              
1             0  
2             0  
3             0  
4             0  
5             0  


### (6) Renaming some columns 
 - Use *data.rename(columns={old name:new name})*

In [8]:
# Change my Low and High column names to something more descriptive
tdata2 = tdata.rename(columns={"Low": "T_low", "High": "T_high"})
print(tdata2.head())

     T_high  T_low  Mean  Depart  Heating  Cooling  Precipitation  Snowfall  \
Day                                                                           
1      84.0     66  75.0     0.0      0.0     10.0           0.55         0   
2      88.0     61  75.0     0.0      0.0     10.0           0.00         0   
3      89.0     67  78.0     3.0      0.0     13.0           0.00         0   
4      89.0     67  78.0     3.0      0.0     13.0           0.00         0   
5      86.0     70  78.0     3.0      0.0     13.0           0.00         0   

     Snow Depth  
Day              
1             0  
2             0  
3             0  
4             0  
5             0  


### (7)  Working with data type
- *data.dtypes* returns the data types of each column of data in the DataFrame
<br><br>
- use "dtype" argument in pd.read_csv to specify data type of column(s) you're reading in
  - example:{'T_high':np.float64,...}

In [9]:
print(tdata2.dtypes) # Cool!

T_high           float64
T_low              int64
Mean             float64
Depart           float64
Heating          float64
Cooling          float64
Precipitation    float64
Snowfall           int64
Snow Depth         int64
dtype: object


### (8)  Identifying and handling NaNs
- remember, once you identify NaNs, you need to wisely decide what to do with them
<br><br> 
- can identify using *pd.isnull(data)*

In [10]:
pd.isnull(tdata)

Unnamed: 0_level_0,High,Low,Mean,Depart,Heating,Cooling,Precipitation,Snowfall,Snow Depth
Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False
8,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False
10,False,False,False,False,False,False,False,False,False


- number of non-NaN values in each column: *data.count*

In [11]:
tdata.count()

High             30
Low              31
Mean             30
Depart           30
Heating          30
Cooling          30
Precipitation    31
Snowfall         31
Snow Depth       31
dtype: int64

- drop rows with NaNs: *data.dropna()*

In [12]:
# Defaults to dropping rows containing ANY missing values
# Can specify minimum number of non-NaN values that must be present to preserve a row
# containing NaNs via 'thresh' option
# Or specify to drop columns instead
tdata = tdata.dropna()
print(tdata)

     High  Low  Mean  Depart  Heating  Cooling  Precipitation  Snowfall  \
Day                                                                       
1    84.0   66  75.0     0.0      0.0     10.0           0.55         0   
2    88.0   61  75.0     0.0      0.0     10.0           0.00         0   
3    89.0   67  78.0     3.0      0.0     13.0           0.00         0   
4    89.0   67  78.0     3.0      0.0     13.0           0.00         0   
5    86.0   70  78.0     3.0      0.0     13.0           0.00         0   
6    89.0   70  80.0     5.0      0.0     15.0           0.00         0   
7    92.0   69  81.0     6.0      0.0     16.0           0.00         0   
8    83.0   64  74.0    -1.0      0.0      9.0           0.00         0   
9    88.0   59  74.0    -1.0      0.0      9.0           0.00         0   
10   90.0   68  79.0     4.0      0.0     14.0           0.04         0   
11   85.0   69  77.0     2.0      0.0     12.0           0.09         0   
13   85.0   74  80.0     

- if instead wanted to fill in missing values (not appropriate in current example)
 - *data.fillna*
 - can specify value, method (backfill, forward fill, etc.), limit (how many observations must be non NaN in proximity), etc.

### There are additional special ways to read in and initially process your data for you to explore!