# Introduction to Pandas

## What is Data Analysis?

<b>Data Analysis </b>is a method of collecting, organizing raw data and study it to make obvious the hidden patterns or trends, hence, we to deduce a useful information that helps us make an informed decision. In addition, sometimes the data is manipulated to deal with abnormalities in the data.

<b>Data Ananlysis and Pandas</b>
- <b>Pandas </b> is an open source Python library
- <b>Pandas </b> is a tool used by Data Scientists to do the following:
    - Read
    - Write
    - Manipulate
    - Analyze
    
## Why Pandas?
- Ease exploration and manipulation of large data in an efficient manner.
- We can use it to analyze large volume of data with ease. The large volume of data could mean million of rows/records.

## What makes Pandas a popular tool
- It's easy to read and learn.
- Extremely fast and powerful.
- Integrates well with other visualization libraries

## Series and DataFrame
<b>Note:</b> There are two fundamental data structures or objects at the core of the <b>Pandas</b> library. These are:
1. Series
2. DataFrame

### Explain
<b>1. Series: </b>
 - Series is a one-dimensional labeled array. 
 - It can hold data of any type. 
 - It is like a column in a given table.

Note: All the elements in the Pandas Series are labeled with indexes. Hence, by default indexing starts from 0 in series.

### Syntax
> df_name = pandas.Series(list_of_values)

<b>Example</b>
> data = pd.Series([0.25, 0.5, 0.75, 1.0])

<b>Explicit indexing</b>
> data_2 = pd.Series(arr_1, index=[1, 2, 3, 4])

In [34]:
# to create pandas series we can use Python list, numpy_array
import pandas as pd

prices = [23.00, 50.00, 80.00, 45.00, 12.00]

price_series = pd.Series(prices)

print(type(prices))

print(type(price_series))

print(price_series)

<class 'list'>
<class 'pandas.core.series.Series'>
0    23.0
1    50.0
2    80.0
3    45.0
4    12.0
dtype: float64


In [35]:
price_series[2]

80.0

In [36]:
import numpy as np

np_arr = np.array([i for i in range(1, 10, 2)])
print(np_arr.shape)
# print([i for i in range(1, 10)])

# create pandas series using numpy array
# explicit state the index of a series
np_series = pd.Series(np_arr, index=[1, 2, "SPa", 4, "nine"])
np_series

(5,)


1       1
2       3
SPa     5
4       7
nine    9
dtype: int64

In [37]:
np_series.index

Index([1, 2, 'SPa', 4, 'nine'], dtype='object')

In [38]:
np_series.values

array([1, 3, 5, 7, 9])

In [39]:
np_series["nine"]

9

In [40]:
np_series["SPa"]

5

In [41]:
np_series[1]

1

In [42]:
# Exercise;
# Use python list and dictionary to create pandas series.
# Explicitly state the index of the pandas series create using list

<b>2. DataFrame: </b>

  - DataFrame is two-dimensional table.
  - It is made up of a collection of Series.
  - Structured with labeled axes (rows and columns)
  
 Note: <b>each column</b> in the <b>DataFrame</b> is a <b>Series</b> of Pandas.
 
 ### Outline:
 - <b>Read</b> CSV file
 - <b>Write</b> DataFrame to CSV file
 - <b>Look into</b> the Read data
 - <b>Indexing</b>
   > - <b>iloc:</b> index based selection
   > - <b>loc:</b> labeled based selection
   
   > Note: Conceptually <b>iloc</b> and <b>loc</b> are similar. However, the difference between them is that <b>iloc</b> considers the default indexing while <b>loc</b> ignores the default indexing.
 - <b>Summary</b>
 - <b>Aggregation Functions</b>
 - <b>Conditional Selection</b>
 - <b>Sorting</b>
   > - <b>ascending</b>
   > - <b>descending</b>
   
   > Note: using column name of the DataFrame
 - Column | Row <b>Renaming</b>
 - <b>Missing Data</b>
   > - <b>Checking</b>
   > - <b>Filling</b>
   > - <b>Dropping</b>
 - <a href="https://www.google.com/?q=Pandas">Read More About <b>Pandas</b></a> 


In [45]:
import pandas as pd
import os

In [46]:
print(dir(os))

['CLD_CONTINUED', 'CLD_DUMPED', 'CLD_EXITED', 'CLD_KILLED', 'CLD_STOPPED', 'CLD_TRAPPED', 'DirEntry', 'EX_CANTCREAT', 'EX_CONFIG', 'EX_DATAERR', 'EX_IOERR', 'EX_NOHOST', 'EX_NOINPUT', 'EX_NOPERM', 'EX_NOUSER', 'EX_OK', 'EX_OSERR', 'EX_OSFILE', 'EX_PROTOCOL', 'EX_SOFTWARE', 'EX_TEMPFAIL', 'EX_UNAVAILABLE', 'EX_USAGE', 'F_LOCK', 'F_OK', 'F_TEST', 'F_TLOCK', 'F_ULOCK', 'GenericAlias', 'Mapping', 'MutableMapping', 'NGROUPS_MAX', 'O_ACCMODE', 'O_APPEND', 'O_ASYNC', 'O_CLOEXEC', 'O_CREAT', 'O_DIRECTORY', 'O_DSYNC', 'O_EVTONLY', 'O_EXCL', 'O_EXLOCK', 'O_FSYNC', 'O_NDELAY', 'O_NOCTTY', 'O_NOFOLLOW', 'O_NOFOLLOW_ANY', 'O_NONBLOCK', 'O_RDONLY', 'O_RDWR', 'O_SHLOCK', 'O_SYMLINK', 'O_SYNC', 'O_TRUNC', 'O_WRONLY', 'POSIX_SPAWN_CLOSE', 'POSIX_SPAWN_DUP2', 'POSIX_SPAWN_OPEN', 'PRIO_PGRP', 'PRIO_PROCESS', 'PRIO_USER', 'P_ALL', 'P_NOWAIT', 'P_NOWAITO', 'P_PGID', 'P_PID', 'P_WAIT', 'PathLike', 'RTLD_GLOBAL', 'RTLD_LAZY', 'RTLD_LOCAL', 'RTLD_NODELETE', 'RTLD_NOLOAD', 'RTLD_NOW', 'R_OK', 'SCHED_FIFO', 'SC

In [47]:
os.getcwd()

'/Users/sunday-p-afolabi_mac/Code/dataset'

In [48]:
print(dir(pd))

['ArrowDtype', 'BooleanDtype', 'Categorical', 'CategoricalDtype', 'CategoricalIndex', 'DataFrame', 'DateOffset', 'DatetimeIndex', 'DatetimeTZDtype', 'ExcelFile', 'ExcelWriter', 'Flags', 'Float32Dtype', 'Float64Dtype', 'Grouper', 'HDFStore', 'Index', 'IndexSlice', 'Int16Dtype', 'Int32Dtype', 'Int64Dtype', 'Int8Dtype', 'Interval', 'IntervalDtype', 'IntervalIndex', 'MultiIndex', 'NA', 'NaT', 'NamedAgg', 'Period', 'PeriodDtype', 'PeriodIndex', 'RangeIndex', 'Series', 'SparseDtype', 'StringDtype', 'Timedelta', 'TimedeltaIndex', 'Timestamp', 'UInt16Dtype', 'UInt32Dtype', 'UInt64Dtype', 'UInt8Dtype', '__all__', '__builtins__', '__cached__', '__doc__', '__docformat__', '__file__', '__git_version__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_built_with_meson', '_config', '_is_numpy_dev', '_libs', '_pandas_datetime_CAPI', '_pandas_parser_CAPI', '_testing', '_typing', '_version_meson', 'annotations', 'api', 'array', 'arrays', 'bdate_range', 'compat', 'conca

In [49]:
# CSV - Comma Separated Values

### Read CSV File

In [52]:
# titanic_df = pd.read_csv("/Users/sunday-p-afolabi_mac/Code/dataset/titanic_data.csv")
titanic_df = pd.read_csv("titanic_data.csv")

In [54]:
type(titanic_df)

pandas.core.frame.DataFrame

### Write DataFrame to CSV File

In [63]:

df_example = pd.DataFrame([[1, 2, 3, 4, 5, 6], [2, 3, 4, 5, 6, 7]])
df_example

Unnamed: 0,0,1,2,3,4,5
0,1,2,3,4,5,6
1,2,3,4,5,6,7


### Looking into the data

In [65]:
df_example.to_csv("teach_pandas_df.csv", index=False)

In [67]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [68]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


> <b>Link to the dataset</b>
## https://raw.githubusercontent.com/dphi-official/Datasets/master/titanic_data.csv

Thank to <b>AI Planet</b> for the link.

### Indexing in Pandas

> <b>Indexing in Pandas </b> means selecting particular rows and columns from DataFrame.

`There are`<b>`two`</b>`different methods for`<b>`indexing`</b>`in Pandas.`
- iloc
- loc

### Summary

> <b>Summary</b> is a brief information about the whole dataset | DataFrame in our stead.

#### To generate summary about DataFrame use the following
- info() method
- describe() method

#### Syntax
> df_name.info() <b>OR</b> df_name.describe()

### Aggregation Function

- mean()
- median()
- sum()
- unique()
- value_counts()

<b>Note:</b>

- `To see all the unique values and they frequencies in the dataset, use value_counts() method.`

- `std: tells us about the variation in the data we are working on.`


### Conditional Selection

In a scenario where you need to filter entries based on a particular condition or a given constraint, this is handy.

#### Syntax
`df_name[condition(s)]`

Note: Clause might be more than one.
- if more than one and those most be True, use <b>`&`</b>
- if more than one and either might be True or False, use <b>`|`</b>

### Sorting

#### Syntax
- `df_name.sort_values(by=column_name)`

> <b>Note:</b> By defaulf the about command would produce an ascending ordered result.

##### For descending order use
- `df_name.sort_values(by=column_name, ascending=False)`

### Renaming

To change column or row name.

#### Syntax
- <b>For column of a given DataFrame, use the following</b>

```
df_name.rename(
    columns={
            previous_column_name:name_column_name
        }, 
    inplace=True
)
```

- <b>For row of a given DataFrame, use the following</b>

```
df_name.rename(
    index={
        previous_column_name:name_column_name
    }
)
```

### Handling Missing Data

#### Syntax
- Checking if there is a such situation
  - <b>`df_name.isna()`</b> OR <b>`df_name.isnull()`</b>

#### Syntax
- Filling
  - <b>`df_name.fillna(new_value, inplace=True)`</b>
  
#### Syntax
- Dropping
  - <b>`df_name.dropna()`</b>

<b>Note: In my own opinion</b>, though based on the volume of data we are working on; if the missing data takes 10 - 20% of the data points, we can fill those. However, if the missing data points takes 40 & above, I advise we drop those missing data points.