![BY%20OYEBISI%20OYETAYO.png](attachment:BY%20OYEBISI%20OYETAYO.png)

![](C:\Users\User\Pictures\Pandas.png)

In [1]:
import numpy as np
import pandas as pd 

# There are two primary data structure during analysis
* ## Series and 
* ## DataFrame

The Series
The Series is the object of the pandas library designed to represent one-dimensional data structures, 
similarly to an array but with some additional features. Its internal structure and is 
composed of two arrays associated with each other. The main array has the purpose to hold the data (data of 
any NumPy type) to which each element is associated with a label, contained within the other array, called 
the Index.

In [2]:
s =pd.Series([12,-4,7,9], index=['a','b','c', 'd'])
s

a    12
b    -4
c     7
d     9
dtype: int64

t to individually see the two arrays that make up this data structure you can call the two 
attributes of the Series as follows: index and values. 

In [3]:
s.values

array([12, -4,  7,  9], dtype=int64)

In [4]:
s.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [5]:
s[2]

7

In [6]:
s["b"]

-4

In [7]:
s[0:2]

a    12
b    -4
dtype: int64

In [8]:
s[['b','c']]

b   -4
c    7
dtype: int64

## Assigning Values to the Elements


In [9]:
s[1]=0
s

a    12
b     0
c     7
d     9
dtype: int64

In [10]:
s['b'] = 1
s

a    12
b     1
c     7
d     9
dtype: int64

## Evaluate Values 

In [11]:
serd = pd.Series([1,0,2,1,2,3], index= ['white','white','blue','green','green','yellow']
            )
serd

white     1
white     0
blue      2
green     1
green     2
yellow    3
dtype: int64

In [12]:
serd.unique()

array([1, 0, 2, 3], dtype=int64)

A function similar to unique() is the value_counts() function, which not only returns the unique values 
but calculates occurrences within a Series

In [13]:
serd.value_counts()

1    2
2    2
0    1
3    1
dtype: int64

Finally, isin() is a function that evaluates the membership, that is, given a list of values, this function 
lets you know if these values are contained within the data structure. Boolean values that are returned can be 
very useful during the filtering of data within a series or in a column of a DataFrame

In [14]:
serd.isin([0])

white     False
white      True
blue      False
green     False
green     False
yellow    False
dtype: bool

In [15]:
serd.isin([0,3])

white     False
white      True
blue      False
green     False
green     False
yellow     True
dtype: bool

# NaN Values 

This specific value NaN (Not a Number) is used within pandas data structures to indicate the 
presence of an empty field or not definable numerically.
Generally, these NaN values are a problem and must be managed in some way, especially during data 
analysis. These data are generated especially when extracting data from some source gave some trouble, 
or even when the source is a missing data. Furthermore, as you have just seen, the NaN values can also 
be generated in special cases, such as calculations of logarithms of negative values, or exceptions during 
execution of some calculation or function. In later chapters we will see how to apply different strategies to 
address the problem of NaN values.
Despite their problematic nature, however, pandas allows to explicitly define and add this value in a 
data structure, such as Series. Within the array containing the values you enter np.NaN wherever we want to 
define a missing value

In [16]:
s2 = pd.Series([5,-3,np.NaN,14])
s2

0     5.0
1    -3.0
2     NaN
3    14.0
dtype: float64

### The isnull() and notnull() functions are very useful to identify the indexes without a value

In [17]:
s2.isnull()

0    False
1    False
2     True
3    False
dtype: bool

The isnull() function returns ‘True’ at NaN values in 
the Series; inversely, the notnull() function returns ‘True’ if they are not NaN. These functions are useful to 
be placed inside a filtering to make a condition

## Series as Dictionaries
An alternative way to see a Series is to think of them as an object dict (dictionary). This similarity is also 
exploited during the definition of an object Series. In fact, you can create a series from a dict previously 
defined

In [18]:
mydict = {'red': 2000, 'blue': 1000, 'yellow': 500, 'orange': 1000}
myseries = pd.Series(mydict)
myseries

red       2000
blue      1000
yellow     500
orange    1000
dtype: int64

In [19]:
colors = ['red','yellow','orange','blue','green']
myseries = pd.Series(mydict, index=colors)
myseries

red       2000.0
yellow     500.0
orange    1000.0
blue      1000.0
green        NaN
dtype: float64

## The DataFrame
The DataFrame is a tabular data structure very similar to the Spreadsheet (the most familiar are Excel 
spreadsheets). This data structure is designed to extend the case of the Series to multiple dimensions. In fact, 
the DataFrame consists of an ordered collection of columns, each of which can contain a 
value of different type (numeric, string, Boolean, etc.)

## Defining a DataFrame
The most common way to create a new DataFrame is precisely to pass a dict object to the DataFrame() 
constructor. This dict object contains a key for each column that we want to define, with an array of values 
for each of them

In [20]:
data = {'color' : ['blue','green','yellow','red','white'],
        'object' : ['ball','pen','pencil','paper','mug'],
        'price' : [1.2,1.0,0.6,0.9,1.7]}
frame =pd.DataFrame(data)
frame

Unnamed: 0,color,object,price
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,0.6
3,red,paper,0.9
4,white,mug,1.7


If the object dict from which we want to create a DataFrame contains more data than we are interested, 
you can make a selection. In the constructor of the data frame, you can specify a sequence of columns, using 
the columns option. The columns will be created in the order of the sequence regardless of how they are 
contained within the object dict.

In [21]:
frame2 = pd.DataFrame(data, columns=['object','price'])
frame2

Unnamed: 0,object,price
0,ball,1.2
1,pen,1.0
2,pencil,0.6
3,paper,0.9
4,mug,1.7


In [22]:
frame3 = pd.DataFrame(np.arange(16).reshape((4,4)),
                      index=['red','blue','yellow','white'],
                      columns=['ball','pen','pencil','paper'])
frame3

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [23]:
frame3.columns

Index(['ball', 'pen', 'pencil', 'paper'], dtype='object')

In [24]:
frame3.index

Index(['red', 'blue', 'yellow', 'white'], dtype='object')

In [25]:
frame3.values

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [26]:
frame.values

array([['blue', 'ball', 1.2],
       ['green', 'pen', 1.0],
       ['yellow', 'pencil', 0.6],
       ['red', 'paper', 0.9],
       ['white', 'mug', 1.7]], dtype=object)

In [27]:
frame['price']

0    1.2
1    1.0
2    0.6
3    0.9
4    1.7
Name: price, dtype: float64

In [28]:
frame[['price']]

Unnamed: 0,price
0,1.2
1,1.0
2,0.6
3,0.9
4,1.7


In [29]:
frame.price

0    1.2
1    1.0
2    0.6
3    0.9
4    1.7
Name: price, dtype: float64

In [30]:
frame


Unnamed: 0,color,object,price
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,0.6
3,red,paper,0.9
4,white,mug,1.7


In [31]:
frame[2:3]

Unnamed: 0,color,object,price
2,yellow,pencil,0.6


In [32]:
frame[1:3]

Unnamed: 0,color,object,price
1,green,pen,1.0
2,yellow,pencil,0.6


In [33]:
frame['new']=2
frame

Unnamed: 0,color,object,price,new
0,blue,ball,1.2,2
1,green,pen,1.0,2
2,yellow,pencil,0.6,2
3,red,paper,0.9,2
4,white,mug,1.7,2


however, you want to do an update of the contents of a column, you have to use an array.

 

In [34]:
 frame['new'] = [3.0,1.3,2.2,0.8,1.1]
 frame

Unnamed: 0,color,object,price,new
0,blue,ball,1.2,3.0
1,green,pen,1.0,1.3
2,yellow,pencil,0.6,2.2
3,red,paper,0.9,0.8
4,white,mug,1.7,1.1


In [35]:
frame.isin([1.0,'pen'])

Unnamed: 0,color,object,price,new
0,False,False,False,False
1,False,True,True,False
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False


In [36]:
frame['color'].isin(['blue','red'])

0     True
1    False
2    False
3     True
4    False
Name: color, dtype: bool

## Deleting a Column
If you want to delete an entire column with all its contents, then use the del command

In [37]:
del frame['new']

In [38]:
frame

Unnamed: 0,color,object,price
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,0.6
3,red,paper,0.9
4,white,mug,1.7


## Filtering 

In [39]:
frame3[frame3<12]

Unnamed: 0,ball,pen,pencil,paper
red,0.0,1.0,2.0,3.0
blue,4.0,5.0,6.0,7.0
yellow,8.0,9.0,10.0,11.0
white,,,,


## DataFrame from Nested dict

In [40]:
nestdict = { 'red': { 2012: 22, 2013: 33 },
 'white': { 2011: 13, 2012: 22, 2013: 16},
 'blue': {2011: 17, 2012: 27, 2013: 18}}
frame4 =pd.DataFrame(nestdict)
frame4

Unnamed: 0,red,white,blue
2012,22.0,22,27
2013,33.0,16,18
2011,,13,17


## Transposition of a DataFrame
An operation that might be needed when dealing with tabular data structures is the transposition (that is, 
the columns become rows and rows columns). pandas allows you to do this in a very simple way. You can 
get the transpose of the data frame by adding the T attribute to its application.

In [41]:
frame4.T

Unnamed: 0,2012,2013,2011
red,22.0,33.0,
white,22.0,16.0,13.0
blue,27.0,18.0,17.0


In [42]:
frame4.idxmin()

red      2012
white    2011
blue     2011
dtype: int64

In [43]:
frame4.idxmax()

red      2013
white    2012
blue     2012
dtype: int64

In [44]:
frame4.columns

Index(['red', 'white', 'blue'], dtype='object')

In [45]:
frame4['red'].min()

22.0

In [46]:
frame4['red'].min

<bound method NDFrame._add_numeric_operations.<locals>.min of 2012    22.0
2013    33.0
2011     NaN
Name: red, dtype: float64>

## Statistics Functions
However, the majority of the statistical functions for arrays are still valid for DataFrame, so the use of the 
apply() function is no longer necessary. For example, functions such as sum() and mean() can calculate 
the sum and the average, respectively, of the elements contained within a DataFrame.

In [47]:
frame3.sum()

ball      24
pen       28
pencil    32
paper     36
dtype: int64

In [48]:
frame3.mean()

ball      6.0
pen       7.0
pencil    8.0
paper     9.0
dtype: float64

In [49]:
frame3.columns

Index(['ball', 'pen', 'pencil', 'paper'], dtype='object')

In [50]:
frame3['pencil'].mean()

8.0

In [51]:
frame3['pencil'].median()

8.0

There is also a function called describe() that allows to obtain a summary statistics at once

In [52]:
frame3.describe()

Unnamed: 0,ball,pen,pencil,paper
count,4.0,4.0,4.0,4.0
mean,6.0,7.0,8.0,9.0
std,5.163978,5.163978,5.163978,5.163978
min,0.0,1.0,2.0,3.0
25%,3.0,4.0,5.0,6.0
50%,6.0,7.0,8.0,9.0
75%,9.0,10.0,11.0,12.0
max,12.0,13.0,14.0,15.0


## Sorting and Ranking
Another fundamental operation that makes use of the indexing is sorting. Sorting the data is often a 
necessity and it is very important to be able to do easily. Pandas provides the sort_index() function that 
returns a new object which is identical to the start, but in which the elements are ordered

In [53]:
ser = pd.Series([5,0,3,8,4], 
               index=['red','blue','yellow','white','green'])

In [54]:
ser

red       5
blue      0
yellow    3
white     8
green     4
dtype: int64

In [55]:
ser.sort_index()

blue      0
green     4
red       5
white     8
yellow    3
dtype: int64

This is the default behavior, but you can set the opposite order, using the ascending option set to False

In [56]:
 ser.sort_index(ascending=False)

yellow    3
white     8
red       5
green     4
blue      0
dtype: int64

As regards the DataFrame, the sorting can be performed independently on each of its two axes. So 
if you want to order by row following the indexes, just continue to use the function sort_index() without 
arguments as you’ve seen before, or if you prefer to order by columns, you will need to use the axis options 
set to 1.

In [57]:
frame3.sort_index()

Unnamed: 0,ball,pen,pencil,paper
blue,4,5,6,7
red,0,1,2,3
white,12,13,14,15
yellow,8,9,10,11


In [58]:
frame3.sort_index(axis=1)

Unnamed: 0,ball,paper,pen,pencil
red,0,3,1,2
blue,4,7,5,6
yellow,8,11,9,10
white,12,15,13,14


## Correlation and Covariance
Two important statistical calculations are correlation and covariance, expressed in pandas by the corr() and 
cov() functions. These kind of calculations normally involve two Series

In [59]:
seq2 = pd.Series([3,4,3,4,5,4,3,2],['2006','2007','2008','2009','2010','2011','2012','2013'])
seq = pd.Series([1,2,3,4,4,3,2,1],['2006','2007','2008','2009','2010','2011','2012','2013'])
seq.corr(seq2)

0.7745966692414835

In [60]:
seq.cov(seq2)

0.8571428571428571

In [61]:
frame3

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [62]:
frame3.corr()

Unnamed: 0,ball,pen,pencil,paper
ball,1.0,1.0,1.0,1.0
pen,1.0,1.0,1.0,1.0
pencil,1.0,1.0,1.0,1.0
paper,1.0,1.0,1.0,1.0


In [63]:
frame3.cov()

Unnamed: 0,ball,pen,pencil,paper
ball,26.666667,26.666667,26.666667,26.666667
pen,26.666667,26.666667,26.666667,26.666667
pencil,26.666667,26.666667,26.666667,26.666667
paper,26.666667,26.666667,26.666667,26.666667


In [64]:
frame3['ball'].corr(frame3['pen'])

1.0

## Filtering Out NaN Values
There are various options to eliminate the NaN values during the data analysis. However, the elimination 
by hand, element by element, can be very tedious and risky, because you never get the certainty of having 
eliminated all the NaN values. The dropna() function comes to aid


In [65]:
ser = pd.Series([0,1,2,np.NaN,9], index=['red','blue','yellow','white','green'])
ser

red       0.0
blue      1.0
yellow    2.0
white     NaN
green     9.0
dtype: float64

In [66]:
ser.dropna()

red       0.0
blue      1.0
yellow    2.0
green     9.0
dtype: float64

Another possibility is to directly perform the filtering function by placing the notnull() in the selection 
condition.

In [67]:
ser[ser.notnull()]

red       0.0
blue      1.0
yellow    2.0
green     9.0
dtype: float64

If you’re dealing with the DataFrame it gets a little more complex. If you use the dropna() function on this 
type of object, it is sufficient that there is only one value NaN on a column or a row to eliminate it completely.

In [68]:
frame5 = pd.DataFrame([[6,np.nan,6],[np.nan,np.nan,np.nan],[2,np.nan,5]],
                      index = ['blue','green','red'],
                      columns = ['ball','mug','pen'])
frame5

Unnamed: 0,ball,mug,pen
blue,6.0,,6.0
green,,,
red,2.0,,5.0


In [69]:
frame5.dropna()

Unnamed: 0,ball,mug,pen


Therefore to avoid having entire rows and columns disappear completely, you should specify the how
option, assigning a value of ‘all’ to it, in order to inform the dropna() function to delete only the rows or 
columns in which all elements are NaN

In [70]:
frame5.dropna(how='all') 


Unnamed: 0,ball,mug,pen
blue,6.0,,6.0
red,2.0,,5.0


## Filling in NaN Occurrences
Rather than filter NaN values within data structures, with the risk of discarding them along with values that 
could be relevant in the context of data analysis, you could replace them with other numbers. For most 
purposes, the fillna() function could be a great choice. This method takes one argument, the value with 
which to replace any NaN. It can be the same for all, as in the following case

In [71]:
frame5.fillna(0)

Unnamed: 0,ball,mug,pen
blue,6.0,0.0,6.0
green,0.0,0.0,0.0
red,2.0,0.0,5.0


In [72]:
 frame5.fillna({'ball':1,'mug':0,'pen':99})

Unnamed: 0,ball,mug,pen
blue,6.0,0.0,6.0
green,1.0,0.0,99.0
red,2.0,0.0,5.0


## Hierarchical Indexing and Leveling
The hierarchical indexing is a very important feature of pandas, as it allows you to have multiple levels of 
indexes on a single axis. Somehow it gives you a way to work with data in multiple dimensions continuing to 
work in a two-dimensional structure.


In [73]:
mser = pd.Series(np.random.rand(8),
                 index=[['white','white','white','blue','blue','red','red','red'],
                        ['up','down','right','up','down','up','down','left']])

In [74]:
mser

white  up       0.582157
       down     0.609543
       right    0.362102
blue   up       0.163451
       down     0.663332
red    up       0.520951
       down     0.226866
       left     0.013601
dtype: float64

As regards the DataFrame, it is possible to define a hierarchical index both for the rows and for the 
columns. At the time of the declaration of the DataFrame, you have to define an array of arrays for both the 
index option and the columns option

In [75]:
 mframe = pd.DataFrame(np.random.randn(16).reshape(4,4),
                       index=[['white','white','red','red'], ['up','down','up','down']],
                       columns=[['pen','pen','paper','paper'],[1,2,1,2]])
mframe

Unnamed: 0_level_0,Unnamed: 1_level_0,pen,pen,paper,paper
Unnamed: 0_level_1,Unnamed: 1_level_1,1,2,1,2
white,up,-0.283535,-0.144218,1.502903,-1.513026
white,down,-0.759367,0.283623,-1.683984,-0.039951
red,up,0.097557,-1.839723,-0.180005,0.49525
red,down,-1.14256,-0.569429,0.600899,0.05739


## pandas: Reading and Writing Data

*Readers*         Writers
read_csv         to_csv
read_excel       to_excel
read_hdf         to_hdf
read_sql         to_sql
read_json        to_json
read_html        to_html
read_stata      to_stata
read_clipboard   to_clipboar

### Reading Data in CSV or Text Files
csvframe = read_csv('myCSV_01.csv')





read_csv('ch05_02.csv', header=None)

### Reading Data in Table
read_table('ch05_01.csv',sep=',')


read_table('ch05_04.txt',sep='\s*')

read_table('ch05_05.txt',sep='\D*',header=None)

### Reading data from html  
 web_frames = pd.read_html('myFrame.html')

### Reading data from excel
pd.read_excel('data.xls')

## Data Preparation
Before you start manipulating data itself, it is necessary to prepare the data and assemble them in the form of 
data structures such that they can be manipulated later with the tools made available by the pandas library. 
The different procedures for data preparation are listed below

* loading
* assembling: merging , concatenating, combining
* reshaping (pivoting)
* removing


Merging—the pandas.merge( ) function connects the rows in a DataFrame based on 
one or more keys. This mode is very familiar to those who are confident with the SQL 
language, since it also implements join operations

Concatenating—the pandas.concat() function concatenates the objects along an axis

Combining—the pandas.DataFrame.combine_first( ) function is a method that 
allows you to connect overlapped data in order to fill in missing values in a data 
structure by taking data from another structure

## Merging
consists of a combination of data through the connection of rows using one or more keys.
m. On the basis of these 
keys it is possible to obtain new data in a tabular form as the result of the combination of other tables. This 
operation with the library pandas is called merging, and merge() is the function to perform this kind of 
operation.
First, you have to import the pandas library and define two DataFrame that will serve you as examples 
for this section


In [76]:
frame6 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],
                        'price': [12.33,11.44,33.21,13.23,33.62]})
frame6

Unnamed: 0,id,price
0,ball,12.33
1,pencil,11.44
2,pen,33.21
3,mug,13.23
4,ashtray,33.62


In [77]:
frame7 = pd.DataFrame( {'id':['pencil','pencil','ball','pen'],
                         'color': ['white','red','red','black']})
frame7

Unnamed: 0,id,color
0,pencil,white
1,pencil,red
2,ball,red
3,pen,black


In [78]:
pd.merge(frame6,frame7)

Unnamed: 0,id,price,color
0,ball,12.33,red
1,pencil,11.44,white
2,pencil,11.44,red
3,pen,33.21,black


the returned DataFrame consists of all rows that have an ID in common 
between the two DataFeame. In addition to the common column, the columns from both the first and the 
second DataFrame are added

In [79]:
 frame8 = pd.DataFrame( {'id':['ball','pencil','pen','mug','ashtray'],
                         'color': ['white','red','red','black','green'],
                         'brand': ['OMG','ABC','ABC','POD','POD']})
frame9= pd.DataFrame( {'id':['pencil','pencil','ball','pen'],
                       'brand': ['OMG','POD','ABC','POD']})

print(frame8)
print(frame9)

        id  color brand
0     ball  white   OMG
1   pencil    red   ABC
2      pen    red   ABC
3      mug  black   POD
4  ashtray  green   POD
       id brand
0  pencil   OMG
1  pencil   POD
2    ball   ABC
3     pen   POD


In [80]:
pd.merge(frame8,frame9)

Unnamed: 0,id,color,brand


DataFrame having columns with the same name. So if you launch a 
merging you do not get any results.

So it is necessary to explicitly define the criterion of merging that pandas must follow, specifying the 
name of the key column in the on option

In [81]:
pd.merge(frame8,frame9,on='id')

Unnamed: 0,id,color,brand_x,brand_y
0,ball,white,OMG,ABC
1,pencil,red,ABC,OMG
2,pencil,red,ABC,POD
3,pen,red,ABC,POD


In [82]:
pd.merge(frame8,frame9,on='brand')

Unnamed: 0,id_x,color,brand,id_y
0,ball,white,OMG,pencil
1,pencil,red,ABC,ball
2,pen,red,ABC,ball
3,mug,black,POD,pencil
4,mug,black,POD,pen
5,ashtray,green,POD,pencil
6,ashtray,green,POD,pen


e DataFrame objects have a join() function which is much more convenient when you want to 
do the merging by indexes. It can also be used to combine many DataFrame objects having the same or the 
same indexes but with columns not overlapping.

## DATA TRANSFORMATION 

 second stage of data manipulation is the data transformation. After the arrangement of the 
form of data and their disposal within the data structure, it is important to transform their values 

## Removing Duplicates
Duplicate rows might be present in a DataFrame for various reasons. In DataFrames of enormous size the 
detection of these rows can be very problematic. Also in this case, pandas provides us with a series of tools to 
analyze the duplicate data present in large data structures.


In [83]:
dframe = pd.DataFrame({ 'color': ['white','white','red','red','white'], 'value': [2,1,3,3,2]})
dframe

Unnamed: 0,color,value
0,white,2
1,white,1
2,red,3
3,red,3
4,white,2


The duplicated() function applied to a DataFrame can detect the rows which appear to be 
duplicated. It returns a Series of Booleans where each element corresponds to a row, with True if the row 
is duplicated (i.e., only the other occurrences, not the first), and with False if there are no duplicates in the 
previous elements.

In [84]:
dframe.duplicated()

0    False
1    False
2    False
3     True
4     True
dtype: bool

In [85]:
dframe[dframe.duplicated()]

Unnamed: 0,color,value
3,red,3
4,white,2


Generally, all duplicated rows are to be deleted from the DataFrame; to do that, pandas provides the 
drop_duplicates() function, which returns the DataFrame without duplicate rows.

In [86]:
dframe.drop_duplicates() 

Unnamed: 0,color,value
0,white,2
1,white,1
2,red,3


In [87]:
dframe2=dframe.drop_duplicates()

In [88]:
dframe2

Unnamed: 0,color,value
0,white,2
1,white,1
2,red,3


## Mapping
The pandas library provides a set of functions  exploit mapping to 
perform some operations. The mapping is nothing more than the creation of a list of matches between two 
different values, with the ability to bind a value to a particular label or string

## Replacing Values via Mapping
Often in the data structure that you have assembled there are values that do not meet your needs. For 
example, the text may be in a foreign language, or may be a synonym of another value, or may not be 
expressed in the desired shape. In such cases, a replace operation of various values is often a necessary 
process.
Define, as an example, a DataFrame containing various objects and colors, including two colors that are 
not in English. Often during the assembly operations is likely to keep maintaining data with values in a form 
that is not desired.

In [89]:
frame10 = pd.DataFrame({ 'item':['ball','mug','pen','pencil','ashtray'],
                     'color':['white','rosso','verde','black','yellow'],
                     'price':[5.56,4.20,1.30,0.56,2.75]})

In [90]:
frame10

Unnamed: 0,item,color,price
0,ball,white,5.56
1,mug,rosso,4.2
2,pen,verde,1.3
3,pencil,black,0.56
4,ashtray,yellow,2.75


In [91]:
newcolors = { 'rosso': 'red','verde': 'green'}
frame10.replace(newcolors)

Unnamed: 0,item,color,price
0,ball,white,5.56
1,mug,red,4.2
2,pen,green,1.3
3,pencil,black,0.56
4,ashtray,yellow,2.75


 from the result, the two colors have been replaced with the correct values within the 
DataFrame. A common case, for example, is the replacement of the NaN values with another value, for 
example 0. Also here you can use the replace(), which performs its job very well

In [92]:
ser2 = pd.Series([1,3,np.nan,4,6,np.nan,3])
ser2

0    1.0
1    3.0
2    NaN
3    4.0
4    6.0
5    NaN
6    3.0
dtype: float64

In [93]:
ser.replace(np.nan,0)

red       0.0
blue      1.0
yellow    2.0
white     0.0
green     9.0
dtype: float64

In [94]:
frame11 = pd.DataFrame({ 'item':['ball','mug','pen','pencil','ashtray'],
                      'color':['white','red','green','black','yellow']})

In [95]:
frame11

Unnamed: 0,item,color
0,ball,white
1,mug,red
2,pen,green
3,pencil,black
4,ashtray,yellow


In [96]:
price = { 'ball' : 5.5, 'mug' : 4.20,'bottle' : 1.30,'scissors' : 3.41, 'pen' : 1.30, 'pencil' : 0.56, 'ashtray' : 2.75}

The map() function applied to a Series or to a column of a DataFrame accepts a function or an object 
containing a dict with mapping. So in your case you can apply the mapping of the prices on the column item, 
making sure to add a column to the price data frame

## Rename the Indexes of the Axes

In [None]:
frame11

In [None]:
reindex = {
... 0: 'first',
... 1: 'second',
... 2: 'third',
... 3: 'fourth',
... 4: 'fifth'}

In [None]:
frame11.rename(reindex)

 By default, the indexes are renamed. If you want to rename columns you must use 
the columns option. Thus this time you assign various mapping explicitly to the two index and 
columns options.

In [None]:
 recolumn = {
 'item':'object',
'price': 'value'}

In [None]:
frame11.rename(index=reindex, columns=recolumn)

In [None]:
frame11.rename(index=reindex, columns=recolumn)

for the simplest cases in which you have a single value to be replaced, it can further explicate 
the arguments passed to the function of avoiding having to write and assign many variables

In [None]:
 frame11.rename(index={1:'first'}, columns={'item':'object'})

## Discretization and Binning

 To carry out an analysis of the data, however, it is necessary to transform this data into discrete 
categories, for example, by dividing the range of values of such readings in smaller intervals and counting 
the occurrence or statistics within each of them. Another case might be to have a huge amount of samples 
due to precise readings on a population. Even here, to facilitate analysis of the data it is necessary to divide 
the range of values into categories and then analyze the occurrences and statistics related to each of them.
In your case, for example, you may have a reading of an experimental value between 0 and 100. These 
data are collected in a list

In [None]:
results = [12,34,67,55,28,90,99,12,3,56,74,44,87,23,49,89,87]

the experimental values have a range from 0 to 100; therefore you can uniformly divide 
this interval, for example, into four equal parts, i.e., bins. The first contains the values between 0 and 25, the 
second between 26 and 50, the third between 51 and 75, and the last between 76 and 100.
To do this binning with pandas, first you have to define an array containing the values of separation of bin:

In [None]:
bins = [0,25,50,75,100]

Then there is a special function called cut() and apply it to the array of results also passing the bins.

In [None]:
cat = pd.cut(results, bins)
cat

In [None]:
cat.categories

In [None]:
pd.value_counts(cat)

In [None]:
bin_names = ['unlikely','less likely','likely','highly likely']
pd.cut(results, bins, labels=bin_names)

## Detecting and Filtering Outliers

During the data analysis, the need to detect the presence of abnormal values within a data structure often 
arises. By way of example, create a DataFrame with three columns from 1,000 completely random values

In [None]:
randframe = pd.DataFrame(np.random.randn(1000,3))
randframe

In [None]:
randframe.describe()

In [None]:
randframe.std()

you might consider outliers those that have a value greater than three times the standard 
deviation. To have only the standard deviation of each column of the DataFrame, use the std() function.

In [None]:
 randframe[(np.abs(randframe) > (3*randframe.std())).any(1)]

 ## Data Aggregation
The last stage of data manipulation is data aggregation. For data aggregation you generally mean a 
transformation that produces a single integer from an array. In fact, you have already made many operations 
of data aggregation, for example, when we calculated the sum(), mean(), count(). In fact, these functions 
operate on a set of data and shall perform a calculation with a consistent result consisting of a single value. 
However, a more formal manner and the one with more control in data aggregation is that which includes 
the categorization of a set

## GroupBy
Now you will analyze in detail what the process of GroupBy is and how it works. Generally, it refers to its 
internal mechanism as a process called SPLIT-APPLY-COMBINE. So in its pattern of operation you may 
conceive this process as divided into three different phases expressed precisely by three operations:



* •	 splitting: division into groups of datasets
* 	 applying: application of a function on each group
* 	 combining: combination of all the results obtained by different groups

In [100]:
Nframe = pd.DataFrame({ 'color': ['white','red','green','red','green'],
                       'object': ['pen','pencil','pencil','ashtray','pen'],
                       'price1' : [5.56,4.20,1.30,0.56,2.75],
                       'price2' : [4.75,4.12,1.60,0.75,3.15]})

In [101]:
Nframe

Unnamed: 0,color,object,price1,price2
0,white,pen,5.56,4.75
1,red,pencil,4.2,4.12
2,green,pencil,1.3,1.6
3,red,ashtray,0.56,0.75
4,green,pen,2.75,3.15


Suppose you want to calculate the average price1 column using group labels listed in the column color. 
There are several ways to do this. You can for example access the price1 column and call the groupby()
function with the column color

In [104]:
group = Nframe['price1'].groupby(Nframe['color'])
group

<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000229795E09D0>

The object that we got is a GroupBy object. In the operation that you just did there was not really any 
calculation; there was just a collection of all the information needed to calculate to be executed. What was 
done is in fact a process of grouping, in which all rows having the same value of color are grouped into 
a single item.

In [106]:
group.mean()

color
green    2.025
red      2.380
white    5.560
Name: price1, dtype: float64

In [107]:
group.sum()

color
green    4.05
red      4.76
white    5.56
Name: price1, dtype: float64

## Hierarchical Grouping
You have seen how to group the data according to the values of a column as a key choice. The same thing 
can be extended to multiple columns, i.e., make a grouping of multiple keys hierarchical.

In [109]:
ggroup = Nframe['price1'].groupby([Nframe['color'],Nframe['object']])

In [111]:
ggroup.sum()

color  object 
green  pen        2.75
       pencil     1.30
red    ashtray    0.56
       pencil     4.20
white  pen        5.56
Name: price1, dtype: float64

In [112]:
ggroup.mean()

color  object 
green  pen        2.75
       pencil     1.30
red    ashtray    0.56
       pencil     4.20
white  pen        5.56
Name: price1, dtype: float64

In [113]:
 means = Nframe.groupby('color').mean().add_prefix('mean_')
>>> means

Unnamed: 0_level_0,mean_price1,mean_price2
color,Unnamed: 1_level_1,Unnamed: 2_level_1
green,2.025,2.375
red,2.38,2.435
white,5.56,4.75


In [114]:
group = Nframe.groupby('color')
group['price1'].quantile(0.6)

color
green    2.170
red      2.744
white    5.560
Name: price1, dtype: float64

In [116]:
def range(series):
    return series.max() - series.min()

group['price1'].agg(range)

color
green    1.45
red      3.64
white    0.00
Name: price1, dtype: float64

In [118]:
def range(series):
    return series.max() - series.min()
group.agg(range)

  group.agg(range)


Unnamed: 0_level_0,price1,price2
color,Unnamed: 1_level_1,Unnamed: 2_level_1
green,1.45,1.55
red,3.64,3.37
white,0.0,0.0


In [120]:
sums = Nframe.groupby('color').sum().add_prefix('tot_')
sums

Unnamed: 0_level_0,tot_price1,tot_price2
color,Unnamed: 1_level_1,Unnamed: 2_level_1
green,4.05,4.75
red,4.76,4.87
white,5.56,4.75


In [130]:
Nframe.groupby('color').transform(np.sum).add_prefix('tot_')

Unnamed: 0,tot_object,tot_price1,tot_price2
0,pen,5.56,4.75
1,pencilashtray,4.76,4.87
2,pencilpen,4.05,4.75
3,pencilashtray,4.76,4.87
4,pencilpen,4.05,4.75


In [136]:
Nframe1 = pd.DataFrame( { 'color':['white','black','white','white','black','black'],
                    'status':['up','up','down','down','down','up'],
                    'value1':[12.33,14.55,22.34,27.84,23.40,18.33],
                    'value2':[11.23,31.80,29.99,31.18,18.25,22.44]})
Nframe1

Unnamed: 0,color,status,value1,value2
0,white,up,12.33,11.23
1,black,up,14.55,31.8
2,white,down,22.34,29.99
3,white,down,27.84,31.18
4,black,down,23.4,18.25
5,black,up,18.33,22.44


In [137]:
Nframe1.groupby(['color','status']).apply( lambda x: x.max())


Unnamed: 0_level_0,Unnamed: 1_level_0,color,status,value1,value2
color,status,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
black,down,black,down,23.4,18.25
black,up,black,up,18.33,31.8
white,down,white,down,27.84,31.18
white,up,white,up,12.33,11.23


In [147]:
reindex = {0:'first',1:'second',2:'third',3:'fourth'}
recolumn = {'price1':'value1', 'price2':'value2'}
Nframe.rename(index=reindex, columns=recolumn)

Unnamed: 0,color,object,value1,value2
first,white,pen,5.56,4.75
second,red,pencil,4.2,4.12
third,green,pencil,1.3,1.6
fourth,red,ashtray,0.56,0.75
4,green,pen,2.75,3.15


In [155]:
import numpy as np 
temp = pd.date_range('1/1/2015', periods=10, freq= 'H')
temp
timeseries = pd.Series(np.random.rand(10), index=temp)
timeseries

2015-01-01 00:00:00    0.429710
2015-01-01 01:00:00    0.803686
2015-01-01 02:00:00    0.017311
2015-01-01 03:00:00    0.570183
2015-01-01 04:00:00    0.094417
2015-01-01 05:00:00    0.256164
2015-01-01 06:00:00    0.388451
2015-01-01 07:00:00    0.951196
2015-01-01 08:00:00    0.206798
2015-01-01 09:00:00    0.100417
Freq: H, dtype: float64

In [159]:
timetable = pd.DataFrame( {'date': temp, 'value1' : np.random.rand(10),'value2' : np.random.rand(10)})
timetable

Unnamed: 0,date,value1,value2
0,2015-01-01 00:00:00,0.4023,0.680672
1,2015-01-01 01:00:00,0.792121,0.962476
2,2015-01-01 02:00:00,0.340117,0.092775
3,2015-01-01 03:00:00,0.530493,0.806974
4,2015-01-01 04:00:00,0.372865,0.744873
5,2015-01-01 05:00:00,0.552031,0.637527
6,2015-01-01 06:00:00,0.409613,0.476015
7,2015-01-01 07:00:00,0.80756,0.43155
8,2015-01-01 08:00:00,0.318821,0.295573
9,2015-01-01 09:00:00,0.582791,0.455838


We add to the DataFrame preceding a column that represents a set of text values that we will use as 
key values.


In [160]:
timetable['cat'] = ['up','down','left','left','up','up','down','right','right','up']
timetable

Unnamed: 0,date,value1,value2,cat
0,2015-01-01 00:00:00,0.4023,0.680672,up
1,2015-01-01 01:00:00,0.792121,0.962476,down
2,2015-01-01 02:00:00,0.340117,0.092775,left
3,2015-01-01 03:00:00,0.530493,0.806974,left
4,2015-01-01 04:00:00,0.372865,0.744873,up
5,2015-01-01 05:00:00,0.552031,0.637527,up
6,2015-01-01 06:00:00,0.409613,0.476015,down
7,2015-01-01 07:00:00,0.80756,0.43155,right
8,2015-01-01 08:00:00,0.318821,0.295573,right
9,2015-01-01 09:00:00,0.582791,0.455838,up


## Conclusions
there are  three basic parts which divide  into manipulation: 
* preparation,
* processing, 
* data aggregation. 

A series of examples  got you to know a set of library functions that allow pandas to perform these operations.


# THE NEXT TO BE DROPPED IS DATA VISUALIZATION 