## Datasets in Python
- 2D Numpy array?
        - Works with only one data type

# Pandas
- High level data manipulation tool
- Library for data analysis
- Provides high-performance, easy-to-use data structures
- Built on Numpy
- More high level compared to Numpy, making it interesting for data scientists all over the world.
- Values in different columns can have different types

## Pandas DataFrames
Tabular datastructure with labeled rows and columns

Rows: Labeled by special data structure called index.

### Index :
Tabled list of labels that permit fast lookup and some relational operations.

Index labels in Apple dataframe = dates in reverse chronological order.

![dataframe](Images/dataframe.png)

## Working with DataFrame in memory

![indexes](Images/indexes.png)

Notice that AAPL.columns is also pandas index.

![datetimeindex](Images/datetimeindex.png)

The AAPL.index attribute in this case is of special kind - called DatetimeIndex

### To select rows from a DataFrame
Dataframes can be sliced like Numpy arrays or python lists:

AAPL[:5,:] is valid

But, square brackets offer limited functionality.

To get something similar to Numpy arrays for Pandas:

## loc (label based)
- To select parts of your data based on labels
- loc function is more versatile than square brackets.
- we can select rows, columns and rows and columns at the same time.
- subsetting becomes much similar to subsetting in numpy.
- only difference- we use row labels in loc, not the position of elements.


![loc](Images/loc.png)

## iloc (integer position based)
Subsetting Pandas DataFrames based on their position or index

![row_acess_iloc](Images/railoc1.png)

![row_acess_iloc](Images/railoc2.png)

![row_and_column_access_iloc](Images/raciloc.png)

![slicing](Images/slicing.png)

### head() method
Another way to see the first few rows of data:
![head](Images/head.png)

### tail() method
Accessing the last 5 rows:
![tail](Images/tail.png)

### info()
Useful summary for large dataframes
![info](Images/info.png)

## Series
- Columns of a Dataframe are themselves a specialized Pandas structure called a Series.
- Extracting a single column from a DataFrame returns a Series.
- The Series extracted has its own head method and inherits its name attribute from Dataframe column.

### values attribute:
- To extract numerical entities from Series
- Yields a Numpy array

### To select columns, use square brackets with DataFrame name

![series](Images/series.png)

Pandas Series = 1D labelled Numpy array
Pandas Dataframe = 2D labelled array whose columns are Series
![series and dataframe](Images/series-and-dataframe.png)

If, instead of a Series object, you require a DataFrame, use double square brackets.

low = AAPL[['Low']]

To get 2 columns at a time:

highlow = AAPL[['High','Low']]

## Building DataFrames from scratch

### DataFrames from CSV files 
When working with tons of data

index_col=0 is to tell read_csv that the first column is the index
![csv_dataframes](Images/csv_dataframes.png)

## Creating DataFrames from Dictionary

### Method 1

In [None]:
import pandas as pd
# keys of dictionary data are used as column labels
# Values = corresponding columns, in list form
data = {'weekday' : ['Sun', 'Sun', 'Mon', 'Mon'],
        'city' : [' Austin', ' Dallas', ' Austin', ' Dallas'],
        'visitors' : [139,237,326,456],
        'signups' : [7,12,3,5]}
users = pd.DataFrame(data)
print(users)

With no index specified, the row labels are integers 0 to 3 by default.

Can set data.index = [list with labels]

### Method 2

In [None]:
cities = [' Austin', ' Dallas', ' Austin', ' Dallas']
signups = [7,12,3,5]
visitors = [139,237,326,456]
weekdays = ['Sun', 'Sun', 'Mon', 'Mon']
list_labels = ['city', 'signups', 'visitors', 'weekday']
list_cols = [cities, signups, visitors, weekdays]  # A list of lists
zipped = list(zip(list_labels,list_cols))

In [None]:
data = dict(zipped)
users = pd.DataFrame(data)
users

## Broadcasting

In [None]:
users['fees'] = 0 #Broadcasts value to entire column
users

## Relabeling

In [None]:
users.columns

In [None]:
list_labels = ['City','Sign-ups','Visitors','Weekday', 'Fees']
users.columns = list_labels
users.columns

In [None]:
#EXERCISE

#Given:
# names, containing the country names for which data is available.
# dr, a list with booleans that tells whether people drive left or right in the corresponding country.
# cpc, the number of motor vehicles per 1000 people in the corresponding country.

# Pre-defined lists
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]

# Import pandas as pd


# Create dictionary my_dict with three key:value pairs: my_dict
my_dict = 


# Build a DataFrame cars from my_dict: cars
cars = 

# Print cars
print(cars)

In [None]:
# Definition of row_labels
row_labels = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']

# Specify row labels of cars


# Print cars again
print()

## Importing and Exporting Data from Data in CSV

### Dataset: Sunspot observations collected from SILSO (Sunspot Index and Long-term Solar Observations)
(Source : SILSO Daily total sunspot number http://www.sidc.be/silso/infossntotdaily)

Over 70k rows

In [None]:
filepath = 'ISSN_D_tot.csv'
#read_csv function requires a string describing a filepath as input
sunspots = pd.read_csv(filepath)  #sunspots = dataframe
sunspots.info()

- We can see that DataFrame mostly has integer or floating point entries.
- Index of the DataFrame(the row labels) are of type RangeIndex(Just Integers).

In [None]:
sunspots.iloc[10:20,:]

Some problems we can notice:
    - The column headers don't make sense
    - Many -1 entries in one column

Reasoning:
    
1. CSV file has no column headers.

Column meanings from SILSO website:

![alternate](Images/SILSO_columns.png)

2. Missing values in column 4 : indicated by -1.

    We need to take care of those.
    
3. Data representation is inconvenient.

### Let's tidy this up!

In [None]:
sunspots = pd.read_csv(filepath, header=None)  
#header=None prevents pandas from assuming first line of file gives column labels.
#Alternatively,an integer header argument gives the row number(indexed from 0) where column labels actually are and data begins

In [None]:
sunspots.iloc[10:20,:]


Now, rows and columns are assigned integers from 0 as labels.

In [None]:
col_names = ['year', 'month', 'day', 'dec_date', 'sunspots', 'definite']
sunspots = pd.read_csv(filepath, header=None, names=col_names)  #names keyword is the important bit here
sunspots.iloc[10:20,:]

In [None]:
# Reading -1 entries as Not a Number or NaN, sometimes called a null value
sunspots = pd.read_csv(filepath, header=None, names=col_names, na_values='-1')
sunspots.iloc[10:20,:]

But the data still shows -1.

Looking at CSV file, there are space characters preceding -1 throughout column 4.

In [None]:
sunspots = pd.read_csv(filepath, header=None, names=col_names, na_values=' -1')
sunspots.iloc[10:20,:]

### Loading year,month,date columns in a better way:
The parse_dates keyword in read_csv infers dates intelligently.

In [None]:
sunspots = pd.read_csv(filepath, header=None, names=col_names, na_values={'sunspots':[-1]}, parse_dates=[[0,1,2]])
# parse_dates uses a list of list to inform read_csv which columns hold the dates
sunspots.iloc[10:20,:]


In [None]:
sunspots.info()
# year_month_day has entries of type datetime64

### Giving meaningful row labels in index

In [None]:
sunspots.index = sunspots['year_month_day']
sunspots.index.name = 'date'
sunspots.info()

### Trimming redundant columns

In [None]:
#list the meaningful columns and extract them
# Result = more compact dataframe with only meaningful data
cols = ['sunspots', 'definite']
sunspots = sunspots[cols]
sunspots.iloc[10:20,:]

## Writing files
- to share this new DataFrame with others
- export compact DataFrame to new CSV file (using to_csv() method)
- can even export to excel using to_excel() method

In [None]:
out_csv = 'sunspots.csv' # csv = comma separated values
sunspots.to_csv(out_csv) 

In [None]:
# Similarly:
# out_tsv = 'sunspots.tsv' # tsv = tab separated values
# sunspots.to_csv(out_tsv)

# out_xlxs = 'sunspots.xlsx' 
# sunspots.to_excel(out_xlsx)


