# Pandas
* Pandas is a built in library using for data analysis. You'll be using Pandas heavily for data manipulation, visualisation, building machine learning models, etc.


* Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

* There are two main data structures in Pandas - Series and Dataframes. The default way to store data is dataframes, and thus manipulating dataframes quickly is probably the most important skill set for data analysis.

    Source: https://pandas.pydata.org/pandas-docs/stable/overview.html


## Pandas Series

* A series is similar to a 1-D numpy array, and contains values of the same type (numeric, character, datetime etc.). A dataframe is simply a table where each column is a pandas series.

* creating series 
    * List
    * Tuple
    * Dictionary
    * Numpy
    * Date_Range
* Series Indexing 


In [3]:
import pandas as pd

In [4]:
# creating Series 

li = [12,34,54,6,6]
s1 = pd.Series(li)
s1

0    12
1    34
2    54
3     6
4     6
dtype: int64

In [5]:
s1.index

RangeIndex(start=0, stop=5, step=1)

In [6]:
s1.values

array([12, 34, 54,  6,  6], dtype=int64)

In [7]:
s1.dtype

dtype('int64')

In [9]:
type(s1)

pandas.core.series.Series

In [10]:
# creating series using Tuple
t = (2,3,4,4.5,6.44,'s','t')
s2 = pd.Series(t)
s2

0       2
1       3
2       4
3     4.5
4    6.44
5       s
6       t
dtype: object

In [11]:
s2.dtype  # "o" - object

dtype('O')

In [12]:
s2.shape

(7,)

In [16]:
# creating series using Dictionary
import numpy as np
di = {12:'Vikas',23:"Gowtham",45:"Vamsi",56:np.nan}
s3 = pd.Series(di)
s3

12      Vikas
23    Gowtham
45      Vamsi
56        NaN
dtype: object

In [17]:
s3.index

Int64Index([12, 23, 45, 56], dtype='int64')

In [18]:
s3.values

array(['Vikas', 'Gowtham', 'Vamsi', nan], dtype=object)

In [20]:
s3.index = ['s',45.56,7,6]
s3

s          Vikas
45.56    Gowtham
7          Vamsi
6            NaN
dtype: object

In [21]:
# creating series using numpy
n = np.array([1,2,3,3,4])
s4 = pd.Series(n)
s4

0    1
1    2
2    3
3    3
4    4
dtype: int32

In [23]:
# Date Range
s5 = pd.date_range(start= "2021-01-01", end = '2021-01-31')
s5

DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08',
               '2021-01-09', '2021-01-10', '2021-01-11', '2021-01-12',
               '2021-01-13', '2021-01-14', '2021-01-15', '2021-01-16',
               '2021-01-17', '2021-01-18', '2021-01-19', '2021-01-20',
               '2021-01-21', '2021-01-22', '2021-01-23', '2021-01-24',
               '2021-01-25', '2021-01-26', '2021-01-27', '2021-01-28',
               '2021-01-29', '2021-01-30', '2021-01-31'],
              dtype='datetime64[ns]', freq='D')

# Slicing

In [24]:
s3

s          Vikas
45.56    Gowtham
7          Vamsi
6            NaN
dtype: object

In [25]:
s3[7] # 7 index value --- explicit slicing

'Vamsi'

In [26]:
s3[:2]  # default index starts from 0 
#   implicit slicing

s          Vikas
45.56    Gowtham
dtype: object

In [28]:
s3[-1:]

6    NaN
dtype: object

In [29]:
s3[1:]

45.56    Gowtham
7          Vamsi
6            NaN
dtype: object

In [30]:
# fancy sliciing

s3[['s',45.56,6]]

s          Vikas
45.56    Gowtham
6            NaN
dtype: object

In [31]:
s4 = pd.Series("TCS",index = ["vamsi","vikas","Goutham","Lavanya"])
s4

vamsi      TCS
vikas      TCS
Goutham    TCS
Lavanya    TCS
dtype: object

# Task

Create a pandas Series  having squares of index values and index values starts from 20 - 35
1 - 1
2 - 4
3 - 9
4 - 16

### Data Analysis with Pandas

##### Dataframe is the most widely used data-structure in data analysis. It is a table with rows and columns, with rows having an index and columns having meaningful names.

* Creating Pandas DataFrame  
* File I/O  (Importing CSV data files as pandas dataframes)
* Merging and Concatenating Dataframes
    * Merge multiple dataframes using common columns/keys using pd.merge()
    * Concatenate dataframes using pd.concat()

* Indexing and Selecting Data

    * Select rows from a dataframe
    * Select columns from a dataframe
    * Select subsets of dataframes 
    * Position and Label Based Indexing: df.iloc and df.loc
        *  You have seen some ways of selecting rows and columns from dataframes. Let's now see some other ways of indexing dataframes, which pandas recommends, since they are more explicit (and less ambiguous).
        * There are two main ways of indexing dataframes:
                * Position based indexing using df.iloc
                * Label based indexing using df.loc
* Grouping and Summarising Dataframes
    * Grouping and aggregation are some of the most frequently used operations in data analysis, especially while doing exploratory data analysis (EDA), where comparing summary statistics across groups of data is common.
    
    * Grouping analysis can be thought of as having three parts:
        1. **Splitting** the data into groups (e.g. groups of customer segments, product categories, etc.)
        2. **Applying** a function to each group (e.g. mean or total sales of each customer segment)
        3. **Combining** the results into a data structure showing the summary statistics
* Features  
* Filtering  
* Sorting  
* Statistical  
* Plotting  
* Saving
    
id |col1 | col2
--|--|--
1|678|xyz
2|123|sdf
3|454|jhg

# Cleaning data in Python 
![download.png](download.png)

* NaN : not a number -- special floating-point value
* Working with duplicates and missing values
    * isnull()
    * notnull()
    * dropna()
    * fillna()
    * replace()
* Which values should be replaced with missing values based on data identifying and eliminating outliers

* Dropping duplicate data



#### Identifying and Eliminating Outliers
* Outliers are observations that are significantly different from other data points
* Outliers can adversely affect the training process of a machine learning algorithm, resulting in a loss of accuracy.
* Need to use the mathematical formula and retrieve the outlier data.

     **interquartile range(IQR) = Q3(quantile(0.75)) − Q1(quantile(0.25))**
     ![boxplot](boxplot.png)