## Pandas
Pandas is the most popular python **Data Analysis & Data Structure** tool.

**Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc**

### Key Features of Pandas
- Fast and efficient DataFrame object with default and customized indexing.
- Tools for loading data into in-memory data objects from different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of date sets.
- Label-based slicing, indexing and subsetting of large data sets.
- Columns from a data structure can be deleted or inserted.
- Group by data for aggregation and transformations.
- High performance merging and joining of data.
- Time Series functionality.


## Data Analysis
- Raw data - information- Prepare- Feature selection- Model Data 
- import data(Data Acquistion) - Data prepartion(Cleaning data, Data Engineer) - EDA - Model Data 

### Installation
Standard Python distribution doesn't come bundled with Pandas module. A lightweight alternative is to install NumPy using popular Python package installer, pip.
- pip install pandas
- If you install Anaconda Python package, Pandas will be installed by default with the following −
  Anaconda (from https://www.continuum.io) is a free Python distribution for SciPy stack.
  
#### Pandas deals with the following three data structures −

- **Series**: Series is a one-dimensional array like structure with homogeneous data.
- **DataFrame**:DataFrame is a two-dimensional array with heterogeneous data
- **Panel**: Panel is a three-dimensional data structure with heterogeneous data. It is hard to represent the panel in graphical representation. But a panel can be illustrated as a container of DataFrame.
These data structures are built on top of Numpy array.

### Data Acquition**: , - Data Prepare: - Data Cleanings, Data manipulation, - Data Enginerring 
   #### Raw Data - Information - Insights- Actions 
- pip install pandas - import pandas 

- **Lets work with Series** 
1. Series (1D) , rows , 
2. Data Frames (2D) , rows and columns 
3. Panel (3D)

• Pandas objects can be thought of as improved versions of NumPy structured arrays at
the fundamental level in which the rows and columns are recognized with labels
instead of simple integer indices

The Pandas Series Object

• A Pandas Series is a 1‐D array of indexed data. It can be created from a array or list as shown in the following code:

In [1]:
# import libraries
import numpy as np
import pandas as pd


In [2]:
a = (91,2,3,4,5,6)
b = ['one','two','three','four','five','six']

In [3]:
x=pd.Series(a,index=b)#dtype=int
x

one      91
two       2
three     3
four      4
five      5
six       6
dtype: int64

In [4]:
x=pd.Series(a,index=b,dtype=float)
x

one      91.0
two       2.0
three     3.0
four      4.0
five      5.0
six       6.0
dtype: float64

In [5]:
y=pd.Series(b,index=a,dtype=str)
y

91      one
2       two
3     three
4      four
5      five
6       six
dtype: object

In [6]:
type(a),type(b)

(tuple, list)

In [7]:
data_series = pd.Series([0.25,0.5,0.75,1])

In [8]:
data_series

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

As shown in the output, A sequence of indices and sequence of values both are wrapped by the series, which we can access with the index attributes and values. The values are simply a familiar NumPy array:

In [9]:
data_series.values

array([0.25, 0.5 , 0.75, 1.  ])

In [10]:
data_series.index

RangeIndex(start=0, stop=4, step=1)

* As NumPy array, data can be obtained by the associated index through the familiar Python square‐bracket notation:

In [11]:
data_series[2]

0.75

In [12]:
data_series[1:4]

1    0.50
2    0.75
3    1.00
dtype: float64

In [13]:
data_series_with_custom_index = pd.Series([0.25,0.5,0.75,1.0],index = ['a','b','c','d'])


In [14]:
data_series_with_custom_index

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

**The Pandas Series is much more general as well as flexible as compared to 1‐D NumPy
array that it emulates**

--------------------------Series as generalized NumPy array-------------

- The Series object is basically interchangeable with a 1‐D NumPy array

- The significant difference is the presence of the index: whereas the Numpy Array has an implicitly defined integer index used in order to obtain the values, the Pandas Series has a clear‐cut defined index associated with the values

- The Series object additional capabilities are provided by this clear index description.

* The index needs not to be an integer but can made up of values of any wanted type. For instance, we can use strings as an index:

In [15]:
a=(200,300,400,770,340)
b=('Anand',"Anmol",'Sonu',"Teju","Ansh")
x=pd.Series(a,index=b)
x

Anand    200
Anmol    300
Sonu     400
Teju     770
Ansh     340
dtype: int64

In [16]:
a=(200,300,400,770,340)
b=('Anand',"Anmol",'Sonu',"Teju","Ansh")
x=pd.Series(a,index=b,dtype=float)
x

Anand    200.0
Anmol    300.0
Sonu     400.0
Teju     770.0
Ansh     340.0
dtype: float64

In [17]:
#Series from list
s = pd.Series([1,2,3,4],index=("I","II","III","IV"),dtype=int)
s

I      1
II     2
III    3
IV     4
dtype: int32

##  Creating series from dictionary

#### Series as specialized dictionary
* A dictionary is a structure which maps arbitrary keys to a collection of arbitrary values, as well as a Series is a structure which maps typed keys to a set of typed values

* This typing is significant: just as the type‐specific compiled code behind a NumPy array makes it more well‐organized than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient as compare to Python dictionaries for certain operations

In [18]:
emp={'A':8,"B":9,"C":6}
details = pd.Series(emp)
details

A    8
B    9
C    6
dtype: int64

In [19]:
type(emp)

dict

In [20]:
type(details)

pandas.core.series.Series

 **Note: Values are used by default  as series  elements & Keys as index**

* Dictionary is a mapping data type , We cannot manupulate index in as we do in case of List & Tuples.

In [21]:
#changing order of index
age = {'ram' : 28,'bob' : 19, 'cam' : 22}
s = pd.Series(age,index=['bob','ram','cam',"anand"],dtype=int)
s 

bob      19.0
ram      28.0
cam      22.0
anand     NaN
dtype: float64

**Note**: NaN=Not a number


* By creating a Series object directly from a Python dictionary the Series‐as‐dictionary analogy can be made even more explicit:

In [22]:
pop_dict = {'California': 38332521,
            'Texas': 26448193,
            'New York': 19651127,
            'Florida': 19552860,
            'Illinois': 12882135}
pop_dict

{'California': 38332521,
 'Texas': 26448193,
 'New York': 19651127,
 'Florida': 19552860,
 'Illinois': 12882135}

In [23]:
pop_dict['Florida']

19552860

In [24]:
pop_series=pd.Series(pop_dict)
pop_series

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [25]:
pop_series['Florida']

19552860

* Array‐style operations such as slicing is also supported by the Series:

* A Series will be built where the index is drawn from the sorted keys by default. Typical dictionary‐style item access can be performed from here:

In [26]:
pop_series['Florida':]

Florida     19552860
Illinois    12882135
dtype: int64

In [27]:
pop_series['New York':]

New York    19651127
Florida     19552860
Illinois    12882135
dtype: int64

In [28]:
pop_series['New York':'Illinois']

New York    19651127
Florida     19552860
Illinois    12882135
dtype: int64

* For instance, data can be a NumPy array or list, in which case index defaults to an integer sequence:

In [29]:
pd.Series([2,4,6])

0    2
1    4
2    6
dtype: int64

* Data can be a scalar, which is repeated in order to fill the specified index:

In [30]:
list_name = ['Anand','Anil','Yash']
pd.Series('hello',index = list_name)

Anand    hello
Anil     hello
Yash     hello
dtype: object

* Data can be a dictionary, in which index defaults to the sorted dictionary keys

In [31]:
disc_ex = {'a':2,'b':5,'c':8}
disc_ex

{'a': 2, 'b': 5, 'c': 8}

In [32]:
s1=pd.Series(disc_ex)
s1

a    2
b    5
c    8
dtype: int64

In [33]:
type(disc_ex),type(s1)

(dict, pandas.core.series.Series)

In [34]:
pd.Series({2:'a',1:'b',3:'c',4:'d'},index=[3,2]) #customize index

3    c
2    a
dtype: object

In [35]:
d = {'a':2,'b':4,'c':6}
d

{'a': 2, 'b': 4, 'c': 6}

In [36]:
ser = pd.Series(d,index=['x','y','z'])
ser

x   NaN
y   NaN
z   NaN
dtype: float64

- **note : Missing value is filled by NAN & index taken by keys**

In [37]:
calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories, index = ["day1", "day2"])

In [38]:
myvar

day1    420
day2    380
dtype: int64

In [39]:
# Read csv file
pd.set_option('display.max_rows',50)
pd.set_option('display.max_columns',50)
diamond = pd.read_csv("F:\FSDA\live\python1\dataset111\diamonds.csv")

In [40]:
diamond

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,4,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...,...
53935,53936,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,53937,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,53938,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,53939,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


In [41]:
diamond.shape # to know no. of rows and num of colomns

(53940, 11)

In [42]:
diamond.describe()# statistics about all columns

Unnamed: 0.1,Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,26970.5,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,15571.281097,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,1.0,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,13485.75,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,26970.5,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,40455.25,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,53940.0,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


In [43]:
diamond.head()# top 5 rows

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [44]:
diamond.head(10)

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
5,6,0.24,Very Good,J,VVS2,62.8,57.0,336,3.94,3.96,2.48
6,7,0.24,Very Good,I,VVS1,62.3,57.0,336,3.95,3.98,2.47
7,8,0.26,Very Good,H,SI1,61.9,55.0,337,4.07,4.11,2.53
8,9,0.22,Fair,E,VS2,65.1,61.0,337,3.87,3.78,2.49
9,10,0.23,Very Good,H,VS1,59.4,61.0,338,4.0,4.05,2.39


In [45]:
diamond.tail(10)#last rows

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
53930,53931,0.71,Premium,E,SI1,60.5,55.0,2756,5.79,5.74,3.49
53931,53932,0.71,Premium,F,SI1,59.8,62.0,2756,5.74,5.73,3.43
53932,53933,0.7,Very Good,E,VS2,60.5,59.0,2757,5.71,5.76,3.47
53933,53934,0.7,Very Good,E,VS2,61.2,59.0,2757,5.69,5.72,3.49
53934,53935,0.72,Premium,D,SI1,62.7,59.0,2757,5.69,5.73,3.58
53935,53936,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.5
53936,53937,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,53938,0.7,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,53939,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74
53939,53940,0.75,Ideal,D,SI2,62.2,55.0,2757,5.83,5.87,3.64


In [46]:
diamond.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  53940 non-null  int64  
 1   carat       53940 non-null  float64
 2   cut         53940 non-null  object 
 3   color       53940 non-null  object 
 4   clarity     53940 non-null  object 
 5   depth       53940 non-null  float64
 6   table       53940 non-null  float64
 7   price       53940 non-null  int64  
 8   x           53940 non-null  float64
 9   y           53940 non-null  float64
 10  z           53940 non-null  float64
dtypes: float64(6), int64(2), object(3)
memory usage: 4.5+ MB


In [47]:
diamond.dtypes# datatype of each coloumns

Unnamed: 0      int64
carat         float64
cut            object
color          object
clarity        object
depth         float64
table         float64
price           int64
x             float64
y             float64
z             float64
dtype: object

In [48]:
diamond.dtypes.unique() ## distinct datatypes used

array([dtype('int64'), dtype('float64'), dtype('O')], dtype=object)

In [49]:
diamond['depth'].isnull().sum()

0

In [50]:
diamond.isnull()

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...
53935,False,False,False,False,False,False,False,False,False,False,False
53936,False,False,False,False,False,False,False,False,False,False,False
53937,False,False,False,False,False,False,False,False,False,False,False
53938,False,False,False,False,False,False,False,False,False,False,False


In [51]:
diamond['cut'].unique() ## all the unique distinct values/categopries from that col

array(['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'], dtype=object)

In [52]:
diamond['cut'].value_counts() ##counts of each category inside that column

Ideal        21551
Premium      13791
Very Good    12082
Good          4906
Fair          1610
Name: cut, dtype: int64

### Obtaining the data from excel

In [53]:
#data = pd.read_excel("F:\FSDA\live\python1\dataset111\diamonds.xlsx")

# Data Frame

In [54]:
import pandas as pd
# Data frame from 1D List
l=["ashi","rom","sid"]
df=pd.DataFrame(l)
df 

Unnamed: 0,0
0,ashi
1,rom
2,sid


In [55]:
#2D lIST
data = [['Nokia',10000],['Asus',12000],['Samsung',13000],['Apple',33000]]
d = pd.DataFrame(data,columns=['Mobile','Price'])#,index=[1,2,3,4])
d

Unnamed: 0,Mobile,Price
0,Nokia,10000
1,Asus,12000
2,Samsung,13000
3,Apple,33000


### Create a data frame containing Details of six students in chem, phy, maths,Roll/name(Index)

**note : If no index or column  is passed, then by default, index will be range(n), where n is the array length.**

In [56]:
import pandas as pd
data =[["A",34,23,67],["B",98,88,78],["C",76,89,87],["D",66,78,99]]
stu = pd.DataFrame(data,columns=['Name',"Maths",'Phy',"Chem"],index=[1,2,3,4],dtype=int)
stu

  stu = pd.DataFrame(data,columns=['Name',"Maths",'Phy',"Chem"],index=[1,2,3,4],dtype=int)


Unnamed: 0,Name,Maths,Phy,Chem
1,A,34,23,67
2,B,98,88,78
3,C,76,89,87
4,D,66,78,99


### Creating Data frame from Series


In [57]:
# Selecting Columns
d = {'Chem' : pd.Series([30, 70, 35,25,26,67,77,67,89], index=["Raj",'Ram', 'Asa', 'Pi','Chi','Ru',"Sita","Ria","Gita"],dtype=int),
     'Math' : pd.Series([18, 26, 35, 40,55,89,79,100], index=["Raj",'Ram', 'Pi', 'Chi', 'Ru',"Sita","Ria","Gita"],dtype=int),
     'Phy' : pd.Series([31, 42, 83,34,80,78], index=["Asa",'Ram', 'Pi', 'Ru',"Sita","Gita"],dtype=int)}
exam = pd.DataFrame(d)


In [58]:
exam

Unnamed: 0,Chem,Math,Phy
Asa,35,,31.0
Chi,26,40.0,
Gita,89,100.0,78.0
Pi,25,35.0,83.0
Raj,30,18.0,
Ram,70,26.0,42.0
Ria,67,79.0,
Ru,67,55.0,34.0
Sita,77,89.0,80.0


In [59]:
exam1=exam.copy() ## To make copy of your data
exam1

Unnamed: 0,Chem,Math,Phy
Asa,35,,31.0
Chi,26,40.0,
Gita,89,100.0,78.0
Pi,25,35.0,83.0
Raj,30,18.0,
Ram,70,26.0,42.0
Ria,67,79.0,
Ru,67,55.0,34.0
Sita,77,89.0,80.0


### Data Preparation
- Removing null values/replace
- Data description 
- Adding new data fields:analysis 
- Feature selection: ML , 
- Predictions:decisions 

In [60]:
# Adding columns
exam["Total"]=exam["Math"]+exam["Phy"]+exam["Chem"]
exam

Unnamed: 0,Chem,Math,Phy,Total
Asa,35,,31.0,
Chi,26,40.0,,
Gita,89,100.0,78.0,267.0
Pi,25,35.0,83.0,143.0
Raj,30,18.0,,
Ram,70,26.0,42.0,138.0
Ria,67,79.0,,
Ru,67,55.0,34.0,156.0
Sita,77,89.0,80.0,246.0


In [61]:
# Adding columns

print ("Adding a new column using the existing columns in DataFrame:")
exam1['Total']=exam1['Chem']+exam1['Math']+exam1['Phy']
exam1 

Adding a new column using the existing columns in DataFrame:


Unnamed: 0,Chem,Math,Phy,Total
Asa,35,,31.0,
Chi,26,40.0,,
Gita,89,100.0,78.0,267.0
Pi,25,35.0,83.0,143.0
Raj,30,18.0,,
Ram,70,26.0,42.0,138.0
Ria,67,79.0,,
Ru,67,55.0,34.0,156.0
Sita,77,89.0,80.0,246.0


### Dealing with missing data
- Check for Missing Values

- To make detecting missing values easier (and across different array dtypes), Pandas provides the isnull() and notnull() functions, which are also methods on Series and DataFrame objects

In [62]:
exam

Unnamed: 0,Chem,Math,Phy,Total
Asa,35,,31.0,
Chi,26,40.0,,
Gita,89,100.0,78.0,267.0
Pi,25,35.0,83.0,143.0
Raj,30,18.0,,
Ram,70,26.0,42.0,138.0
Ria,67,79.0,,
Ru,67,55.0,34.0,156.0
Sita,77,89.0,80.0,246.0


In [63]:
exam1

Unnamed: 0,Chem,Math,Phy,Total
Asa,35,,31.0,
Chi,26,40.0,,
Gita,89,100.0,78.0,267.0
Pi,25,35.0,83.0,143.0
Raj,30,18.0,,
Ram,70,26.0,42.0,138.0
Ria,67,79.0,,
Ru,67,55.0,34.0,156.0
Sita,77,89.0,80.0,246.0


#### Check for null values

In [64]:
exam.isnull().sum()

Chem     0
Math     1
Phy      3
Total    4
dtype: int64

In [65]:
exam1.isnull().sum()

Chem     0
Math     1
Phy      3
Total    4
dtype: int64

### Calculations with Missing Data

- When summing data, NaN will be treated as Zero .If the data are all NaN, then the result will be NaN.

**Cleaning / Filling Missing Data**

- Pandas provides various methods for cleaning the missing values.
- The fillna() function can “fill in” NA values with non-null data in a couple of ways, which we have illustrated in the following sections.

- Replace NaN with a Scalar Value The following program shows how you can replace "NaN" with "0".

In [66]:
exam=exam.fillna(0)
exam


Unnamed: 0,Chem,Math,Phy,Total
Asa,35,0.0,31.0,0.0
Chi,26,40.0,0.0,0.0
Gita,89,100.0,78.0,267.0
Pi,25,35.0,83.0,143.0
Raj,30,18.0,0.0,0.0
Ram,70,26.0,42.0,138.0
Ria,67,79.0,0.0,0.0
Ru,67,55.0,34.0,156.0
Sita,77,89.0,80.0,246.0


In [67]:
exam1=exam1.fillna(0)
exam1

Unnamed: 0,Chem,Math,Phy,Total
Asa,35,0.0,31.0,0.0
Chi,26,40.0,0.0,0.0
Gita,89,100.0,78.0,267.0
Pi,25,35.0,83.0,143.0
Raj,30,18.0,0.0,0.0
Ram,70,26.0,42.0,138.0
Ria,67,79.0,0.0,0.0
Ru,67,55.0,34.0,156.0
Sita,77,89.0,80.0,246.0


In [68]:
a=exam1["Math"].median()
exam1["Math"]= exam1["Math"].fillna(a)
exam1 
b=exam1["Phy"].median()
exam1["Phy"]= exam1["Phy"].fillna(b)
exam1["Total"]=exam1["Math"]+exam1["Phy"]+exam1["Chem"]
exam1

Unnamed: 0,Chem,Math,Phy,Total
Asa,35,0.0,31.0,66.0
Chi,26,40.0,0.0,66.0
Gita,89,100.0,78.0,267.0
Pi,25,35.0,83.0,143.0
Raj,30,18.0,0.0,48.0
Ram,70,26.0,42.0,138.0
Ria,67,79.0,0.0,146.0
Ru,67,55.0,34.0,156.0
Sita,77,89.0,80.0,246.0


### Dropping missing values using "dropna()"


In [69]:
exam=exam.dropna()
exam

Unnamed: 0,Chem,Math,Phy,Total
Asa,35,0.0,31.0,0.0
Chi,26,40.0,0.0,0.0
Gita,89,100.0,78.0,267.0
Pi,25,35.0,83.0,143.0
Raj,30,18.0,0.0,0.0
Ram,70,26.0,42.0,138.0
Ria,67,79.0,0.0,0.0
Ru,67,55.0,34.0,156.0
Sita,77,89.0,80.0,246.0


### Replacing missing values by generic values by replace function

In [70]:
df = pd.DataFrame({'one':[10,20,30,40,50,"ABC"], 'AGE':[-19,1,30,40,50,60]})

df

Unnamed: 0,one,AGE
0,10,-19
1,20,1
2,30,30
3,40,40
4,50,50
5,ABC,60


In [71]:
df = df.replace({"ABC":60,-19:19})
df 

Unnamed: 0,one,AGE
0,10,19
1,20,1
2,30,30
3,40,40
4,50,50
5,60,60


### Stats: Data Description


In [72]:
details = {'Brand':pd.Series(['Nokia','Asus',"Nokia","Nokia",'Samsung',"ABC",'Micromax','Apple','MI','Zen',"Apple"]),
   'Price':pd.Series([10000,8000,12500,7000,40000,12000,12999,13999,59999]),
   'Rating(10)':pd.Series([7,6.5,8.5,9,8,9.5,7,9])}
details

{'Brand': 0        Nokia
 1         Asus
 2        Nokia
 3        Nokia
 4      Samsung
 5          ABC
 6     Micromax
 7        Apple
 8           MI
 9          Zen
 10       Apple
 dtype: object,
 'Price': 0    10000
 1     8000
 2    12500
 3     7000
 4    40000
 5    12000
 6    12999
 7    13999
 8    59999
 dtype: int64,
 'Rating(10)': 0    7.0
 1    6.5
 2    8.5
 3    9.0
 4    8.0
 5    9.5
 6    7.0
 7    9.0
 dtype: float64}

In [73]:
d = pd.DataFrame(details)
d

Unnamed: 0,Brand,Price,Rating(10)
0,Nokia,10000.0,7.0
1,Asus,8000.0,6.5
2,Nokia,12500.0,8.5
3,Nokia,7000.0,9.0
4,Samsung,40000.0,8.0
5,ABC,12000.0,9.5
6,Micromax,12999.0,7.0
7,Apple,13999.0,9.0
8,MI,59999.0,
9,Zen,,


In [74]:
d.mean()

  d.mean()


Price         19610.777778
Rating(10)        8.062500
dtype: float64

In [75]:
d['Price'].mean()

19610.777777777777

### The describe() function computes a summary of statistics pertaining to the DataFrame columns.

In [76]:
d.describe() 

Unnamed: 0,Price,Rating(10)
count,9.0,8.0
mean,19610.777778,8.0625
std,18086.018625,1.116036
min,7000.0,6.5
25%,10000.0,7.0
50%,12500.0,8.25
75%,13999.0,9.0
max,59999.0,9.5


In [77]:
d.describe(include="all")


Unnamed: 0,Brand,Price,Rating(10)
count,11,9.0,8.0
unique,8,,
top,Nokia,,
freq,3,,
mean,,19610.777778,8.0625
std,,18086.018625,1.116036
min,,7000.0,6.5
25%,,10000.0,7.0
50%,,12500.0,8.25
75%,,13999.0,9.0


### Renaming
- The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function.

In [78]:
d

Unnamed: 0,Brand,Price,Rating(10)
0,Nokia,10000.0,7.0
1,Asus,8000.0,6.5
2,Nokia,12500.0,8.5
3,Nokia,7000.0,9.0
4,Samsung,40000.0,8.0
5,ABC,12000.0,9.5
6,Micromax,12999.0,7.0
7,Apple,13999.0,9.0
8,MI,59999.0,
9,Zen,,


In [79]:
d=d.rename(columns={'Brand' : 'Type', 'Avg.Price' : 'Price'},
index = {0 : 'S0', 1 : 'S1', 2 : 'S2'})
d

Unnamed: 0,Type,Price,Rating(10)
S0,Nokia,10000.0,7.0
S1,Asus,8000.0,6.5
S2,Nokia,12500.0,8.5
3,Nokia,7000.0,9.0
4,Samsung,40000.0,8.0
5,ABC,12000.0,9.5
6,Micromax,12999.0,7.0
7,Apple,13999.0,9.0
8,MI,59999.0,
9,Zen,,


In [80]:
x=d["Price"].mean()
x

19610.777777777777

In [81]:
d.isnull().sum()

Type          0
Price         2
Rating(10)    3
dtype: int64

In [82]:
d["Price"]=d["Price"].fillna(x)
d

Unnamed: 0,Type,Price,Rating(10)
S0,Nokia,10000.0,7.0
S1,Asus,8000.0,6.5
S2,Nokia,12500.0,8.5
3,Nokia,7000.0,9.0
4,Samsung,40000.0,8.0
5,ABC,12000.0,9.5
6,Micromax,12999.0,7.0
7,Apple,13999.0,9.0
8,MI,59999.0,
9,Zen,19610.777778,


In [83]:
d["Rating(10)"]=d["Rating(10)"].fillna(d["Rating(10)"].mean())
d

Unnamed: 0,Type,Price,Rating(10)
S0,Nokia,10000.0,7.0
S1,Asus,8000.0,6.5
S2,Nokia,12500.0,8.5
3,Nokia,7000.0,9.0
4,Samsung,40000.0,8.0
5,ABC,12000.0,9.5
6,Micromax,12999.0,7.0
7,Apple,13999.0,9.0
8,MI,59999.0,8.0625
9,Zen,19610.777778,8.0625


In [84]:
d.isnull().sum()

Type          0
Price         0
Rating(10)    0
dtype: int64

## Sorting 

In [85]:
d

Unnamed: 0,Type,Price,Rating(10)
S0,Nokia,10000.0,7.0
S1,Asus,8000.0,6.5
S2,Nokia,12500.0,8.5
3,Nokia,7000.0,9.0
4,Samsung,40000.0,8.0
5,ABC,12000.0,9.5
6,Micromax,12999.0,7.0
7,Apple,13999.0,9.0
8,MI,59999.0,8.0625
9,Zen,19610.777778,8.0625


In [86]:
df2 = d.sort_values(by=['Rating(10)'],ascending= False)#Decending order
df2 

Unnamed: 0,Type,Price,Rating(10)
5,ABC,12000.0,9.5
3,Nokia,7000.0,9.0
7,Apple,13999.0,9.0
S2,Nokia,12500.0,8.5
8,MI,59999.0,8.0625
9,Zen,19610.777778,8.0625
10,Apple,19610.777778,8.0625
4,Samsung,40000.0,8.0
S0,Nokia,10000.0,7.0
6,Micromax,12999.0,7.0


### get_dummies()

- Pass a list with length equal to the number of columns.
- Returns the DataFrame with One-Hot Encoded values.

In [87]:
### to be continue...