# "Edureka"

In [None]:
# Ref: https://www.youtube.com/watch?v=UB3DE5Bgfx4

## DataFrame  
> **A dataframe is a two dimensional, size-mutable, potentially heterogeneous tabular data.** <br/>
    Heterogeneous: Not uniform in structure or composition


### Data Types in Pandas
float------------------------------float64 <br/>
int--------------------------------int64 <br/>
datetime---------------------------datetime64 <br/>
string/Mixed-----------------------------object <br/>
boolean-----------------------------bool <br/>
"Diff between 2 datetimes"-----------------------------timedelta <br/>
Finite list of text values-----------------------------category <br/>

[NB]: Features like gender, country, and codes are always repetitive. These are the examples for categorical data.

In [2]:
import pandas as pd
import numpy as np

## Pandas Series (Intro)

> One-dimensional ndarray with axis labels (including time series) <br/>
    Series are equivalent of python's **List**.

`pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)` <br/>
    data = Provide an array-like data/ list to produce a pandas series. <br/>
    index = Provide an array-like data/ list to produce an index along with the series. <br/>
    dtype= Return the dtype object of the underlying data. <br/>
    name = The name to give to the Series. <br/>

In [None]:
index = ['a','b','c','d','e','f','g','h','i','j']
# s = pd.Series([1,2,3,4,5,6,np.nan,8,9,10], index=index) # define custom-made index
s = pd.Series([1,2,3,4,5,6,np.nan,8,9,10], index=index, name='Test', dtype='str')
s

In [None]:
print(type(s))

## Pandas date_range()

> Generates a range of dates from the **specified starting-date**.

`pandas.date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, **kwargs)`

**start=None** will specify the starting of the date-series. <br/>
**end=None** will specify the ending of the date-series. <br/>
**periods=None** will specify the number of elems (rows) in the date-series. <br/>

In [None]:
# Create a data-frame for dates
# Generates the 31 days of January
# The time will be shown as "year-month-day" format
d = pd.date_range('20210101', periods=31)
d

### Access the Pandas Series

In [None]:
# Pandas series can only be accessible if the series is comprised of dictionary;
# Otherwise onyl the the "values" of a series can be accessed.
# d.keys

In [None]:
d.values

## DataFrame (Intro)
> Create a Dataframe using Numpy Array (**date_range**)

### Dataframe Structure

`dataframe(data, index, columns, dtype, copy)`

The **data** param can be *ndarray (structured or homogeneous), Iterable, dict, or DataFrame.* <br/>
The **index** param will specify the indexing of the rows. <br/>
The **columns** param will specify the columns of the dataframe. <br/>
The **dtype** param will specify the Data type to force. Only a single dtype is allowed. If None, infer. <br/>
The **copy** param will create a copy of the input array before generating the **datafrane**, & use that copy instead of the root array as the **data**. <br/>

`np.random.randn(row, col)`

**New DataFrame (DataFrame-1)**

In [None]:
# Random number dataframe
# Create a dataframe which is consisted of 31 rows * 4 coulmns of random number
df = pd.DataFrame(np.random.randn(31, 4), index=d, columns=['Rand-1', 'Rand-2', 'Rand-3', 'Rand-4'], dtype=np.float64)
# df.set_index('Index')
print('Data-types of dataframe: %s' % df.dtypes)
df #show all the 31 values

**New DataFrame (DataFrame-2)** <br/>
Create a new Dataframe comprised of all the data-types in Pandas

In [10]:
df_new = pd.DataFrame({
    'A': [1,2,3,4],
    'B': pd.Timestamp('20220331'),
    'C': pd.Series(12, index=range(4), dtype='int64'),
})
df_new

Unnamed: 0,A,B,C
0,1,2022-03-31,12
1,2,2022-03-31,12
2,3,2022-03-31,12
3,4,2022-03-31,12


In [6]:
df_new.dtypes

A             int64
B    datetime64[ns]
dtype: object

# View the Data of a DataFrame
#### Tail & Head

In [None]:
# Show the first 5 rows of the dataframe
df.head()

In [None]:
# Show the last 5 rows of the dataframe
df.tail()

In [None]:
# Get all the random values of specific single column
df['Rand-4'].head(n=6)

### Slice the rows of dataframe

In [None]:
# Slice the rows from 0 to 6 out of 31 rows
df[0:6]

In [None]:
# View the index of the dataframe
df.index

In [None]:
# View all the columns of the dataframe
df.columns

In [None]:
# Display the data-types of all the columns of the dataframe
df.dtypes

#### Access a specific column of a dataframe (Type-1)
> `df.columnName`   If the **columnName** is a single-worded column

#### Access a specific column of a dataframe (Type-2)
> `df[columnName]`   If the **columnName** is a multi-worded column

In [None]:

df['Rand-1'].head()

## Numpy Representation of a dataframe

> **An array which displays a nested array**

It doesn't require *copying the data*, it's more preferred.

In [None]:
df.to_numpy()

## Fetch the first & last Row of a DataFrame

**New DataFrame (DataFrame-2)**

In [None]:
def hRuler():
    return print('%s' % ('/'*100))

In [None]:
# Create a datframe where the data will contain a dictionary & the dict will contain different types of data.
# The datatypes insdie the dict are ordered as followed: 
# Series, pandas.Timestamp, pandas.Series (np.random.randn(4,1)), numpy.array(data=list), pandas.Categorical(), string
data = {
    'Series': [1,2,3,4],
    'Pandas Timestamp': pd.Timestamp('20210901'),
    'Pandas Series': pd.Series(np.random.randn(4), index=list(range(4)), dtype='int64'),
    'Numpy Array': np.array(list(range(4)), dtype='int32'),
    'Pandas Categorical': pd.Categorical(['True', 'True', 'False', 'True']),
    'String': 'Elon',
}
df2 = pd.DataFrame(data)
hRuler()
print('%s Data Types %s' % (('*'*10), ('*'*10)))
print(df2.dtypes)
hRuler()
print('%s First 3 Rows of the DataFrame %s' % (('*'*10), ('*'*10)))
print(df2.head(n=3))
hRuler()
print('%s Last 2 Rows of the DataFrame %s' % (('*'*10), ('*'*10)))
df2.tail(n=2)

In [None]:
# Get all the random values of specific single column ('Pandas Series')
df2['Pandas Series']

In [None]:
# Display all the index-rows of the dataframe ('df')
df2.index

In [None]:
# sort the values of specific columns & display the last nth value
df.sort_values(by='Rand-4', ascending=False).tail(n=10)

### Dataframe Describe
> Generate descriptive statistics. It displays the **count**, mean, std (standard deviation), min, max, (25%, 50%, 75%) of the value of each row.

`DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)`

In [None]:
df2.describe()

### DataFrame Loc (Location)

> Access a group of rows and columns by label(s) or a boolean array.

>  Using the **loc** & **iloc**, we can practically do any data-selection-task on Pandas dataframes. <br/>
"**loc**" is label-based, which means that we have to specify the name of the rows and columns that we need to filter out from a dataframe.

`dataFrame.loc[row, columns]`

*Lesson Ref:  https://www.youtube.com/watch?v=sqXiwHUvuh0*

**DataFrame.at (nth row) or (x:y row range)**: Access a single value for a row label pair. Accepts the 'list-range'. We can add *if-conditions* while printing out the rows. <br/>
**DataFrame.columns (nth column) or (x:y column range)**: Access a single value for a column label pair. Accepts the 'list-range'. <br/>

In [None]:
# Import a new dataframe (Imported for a GeeksforGeeks)
df3 = pd.read_csv('../DataSets/employees_geeksforgeeks.csv')
df3.head()

In [None]:
# Prints the first 11 rows of dataframe, the 'index' column is type 'integer' & starting from 0-n.
df3.loc[0:10]

#### Multi access by label into the dataframe using "loc"

In [None]:
# Print the first row & all of its columns. Basically it's printing out all the data of a single row.
# df.loc[at, :]; here the ':' means from 0th to nth column
df3.loc[0, :] # Prints the data of 0th row

In [None]:
# Print an explicitly-mentioned (static) list of rows (0th, 3rd row) including all the columns
df3.loc[[0,3], :]

In [None]:
# Print 0th to 6th rows using list-range of the dataframe including all the columns
# NB: when we are explicitly mentioning the end-row-number using the "loc", it doesn't ignore the 'nth' row like the "iloc"
df3.loc[0:6, :]

In [None]:
# Pirnts the 0th to 10th row of the 'Gender' column. Thus it prints 11 rows
df3.loc[0:10, 'Gender']

In [None]:
# Pirnts the 0th to 5th row of the ['First Name', 'Gender', 'Start Date'] column. Thus it prints 6 rows
df3.loc[0:5, ['First Name', 'Gender', 'Start Date']]

In [None]:
# Prints 6 rows of DataFrame from column range from "Gender" to "Senior Management". Here, we don't need to put the column names in a list, means inside the "[]" brackets.
df3.loc[0:5, 'Gender':'Senior Management']

In [None]:
# Print 6 rows of all the male employees including the columns ("First-Name" to "Senior Management")
df3.loc[df3['Gender'] == 'Male', 'First Name':'Senior Management'].head(n=5)

In [None]:
# Identify the "Last Login Time" of the 6th row in dataframe "df3".
# Exact location of x-axis & y-axis.
df3.at[5, 'Last Login Time']

### DataFrame iLoc (Location)

> Purely integer-location based indexing for selection by position.

`dataFrame.iloc[row-index-val, columns-index-val]`

In [None]:
# In "iloc", the 'nth' index of the column is ignored while using "iloc".
# Prints the first 5 rows including the columns from "First Name" to "Salary"
df3.iloc[0:5, 0:5]

In [None]:
# View the 4th (where the first-name is 'Larry') row of dataframe "df3" using 'iloc'. For declaring row, 'iloc' expects the exact row-indexing.
df3.iloc[4]

In [None]:
# Prints the first 5 rows of the dataframe "df3"
df3.head()

In [None]:
# Prints the first 11 rows of the 3rd column-index ("Last Login Time	" column)
df3.iloc[:10, 3]

## Boolean Indexing (Applying conditions while viewing data from dataframe)

In [None]:
# Boolean Indexing: View the "Salary" column value where Salary < 90000
# NB: It's an annual-salary.
df3[df3['Salary'] < 90000]

### Multiple conditions while using the 'loc' method to view data.

**Ref Link:**
- https://www.kite.com/python/answers/how-to-select-rows-by-multiple-label-conditions-with-pandas-loc-in-python
- https://towardsdatascience.com/conditional-selection-and-assignment-with-loc-in-pandas-2a5d17c7765b

In [None]:
# Prints the first 10 rows of those whose 'Salary' is greater than 100000.
df3.loc[(df3['Salary'] > 100000) & (df3['Senior Management'] == True)].head(n=10)

## Handling the missing data inside a DataFrame

**Support Ref Link: https://www.youtube.com/watch?v=uDr67HBIPz8&t=1s**

> Missing data or null values in a data can create a lot of **ruckus/ disturbance** in other stages of data science life cycle, it's very important to deal with the missing data in an effective manner. Sometimes our datasets contain some empty cells which are called as "null values".

### We cannot provide "null values" to our machine learning model.
#### So these null-values are required to be handled with some calues, which can be,
- Drop such rows.
- Either replace them

- Pandas primarily uses **np.NaN** to represent missing data.

#### Import a new DataFrame (for handling the missing data inside a DataFrame)

> Handled in another notebook. **Notebook Link:** http://localhost:8888/notebooks/Jupyter%20Notebooks/Notebooks/Basics%20of%20Pandas/Handling%20Null%20Values.ipynb


In [None]:
# Re-index, which alows us to change, add, delete data of a specified axis.
# Take a copy of "df3"
df4 = df3.reindex()

## Numpy Array

`numpy.array(object, dtype=None, *, copy=True, order='K', subok=False, ndmin=0, like=None)`