# pandas (Python)

Pandas is a Python library used to analyze data.
It has functions for analyzing, cleaning, exploring, and manipulating data.
pandas will be a major tool of interest. It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python. 
pandas is often used in tandem with numerical computing tools like NumPy and SciPy, analytical libraries like statsmodels and scikit-learn, and data visualization libraries like matplotlib. 
Pandas adopts significant parts of NumPy’s idiomatic style of array-based computing, especially array-based functions and a preference for data processing without for loops.

While pandas adopts many coding idioms from NumPy, the biggest difference is that
pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast,
is best suited for working with homogeneous numerical array data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.
The Relevant data is very important in data science.

What Can Pandas Do?
Pandas gives you answers about the data. Like:
Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?
Pandas are also able to delete rows that are not relevant, or contains wrong values, 
like empty or NULL values. This is called cleaning the data.

In [1]:
# Import Pandas
import pandas as pd
#Now Pandas is imported and ready to use.
from pandas import Series, DataFrame

# Introduction to pandas Data Structures

Series: A Series is a one-dimensional array-like object containing a sequence of values (of
similar types to NumPy types) and an associated array of data labels, called its index.
A Pandas Series is like a column in a table.It is a one-dimensional array holding data of any type.

In [2]:
#Create a simple Pandas Series from a list:
obj = pd.Series([4, 7, -5, 3, 7, 10])
print(obj)

0     4
1     7
2    -5
3     3
4     7
5    10
dtype: int64


In [3]:
#OR
a=[4, 7, -5, 3]
obj = pd.Series(a)
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [4]:
print(obj.values)
print(obj.index) # like range(4)
#If nothing else is specified, the values are labeled with their index number. 
#First value has index 0, second value has index 1 etc.
print(obj[0])# return the first value of the series
print(obj[3])
print(obj[-1])
print(obj[5])


[ 4  7 -5  3]
RangeIndex(start=0, stop=4, step=1)
4
3


KeyError: -1

In [5]:
#Create Labels
# With the index argument, you can create your own labels.
obj2 = pd.Series([4, 7, -5, 3, 10], index=['a', 'b', 'c', 'd','e'])
print(obj2)

a     4
b     7
c    -5
d     3
e    10
dtype: int64


In [6]:
print(obj2)
print(obj2['b'])
#print(obj[3])
obj2['e'] = 6
#obj[]=5
print(obj2)
print(obj2[['e', 'b', 'a']])

a     4
b     7
c    -5
d     3
e    10
dtype: int64
7
a    4
b    7
c   -5
d    3
e    6
dtype: int64
e    6
b    7
a    4
dtype: int64


In [7]:
# Some opeartions
print(obj2)
#obj2[obj2 <5]
print(obj<5)

a    4
b    7
c   -5
d    3
e    6
dtype: int64
0     True
1    False
2     True
3     True
dtype: bool


In [8]:
obj2 * 2

a     8
b    14
c   -10
d     6
e    12
dtype: int64

In [9]:
import numpy as np
np.exp(obj2)

a      54.598150
b    1096.633158
c       0.006738
d      20.085537
e     403.428793
dtype: float64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping
of index values to data values. 
We can also use a key/value object, like a dictionary, when creating a Series.

In [10]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
print(obj3)

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64


In [11]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)  # NaN missing or NA values
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [12]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

## DataFrame

Data sets in Pandas are usually multi-dimensional tables, called DataFrames. Series is like a column, a DataFrame is the whole table

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string,boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index

In [2]:
import pandas as pd
from pandas import DataFrame
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

In [14]:
print(frame)

    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2


In [15]:
#For large DataFrames, the head method selects only the first five rows
frame.head(2)   # frame.head(3) first 3 rows

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7


In [16]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [17]:
frame.tail(3)

Unnamed: 0,state,year,pop
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [18]:
pd.DataFrame(data, columns=['pop', 'state', 'year'])

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002
5,3.2,Nevada,2003


In [6]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
index=['one', 'two', 'three', 'four','five', 'six'])
frame2
#print(pd.isnull(frame2))

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [20]:
print(frame2.columns)
frame2['year']

Index(['year', 'state', 'pop', 'debt'], dtype='object')


one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

In [21]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

In [22]:
frame2.state

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [23]:
frame2.pop

<bound method DataFrame.pop of        year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
six    2003  Nevada  3.2  NaN>

# loc attribute Dataframe

Rows can also be retrieved by position or name with the special loc attribute

In [24]:
print(frame2)
frame2.loc['three']

       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
six    2003  Nevada  3.2  NaN


year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [25]:
frame2.loc['six']

year       2003
state    Nevada
pop         3.2
debt        NaN
Name: six, dtype: object

In [26]:
#Assigning the same values to debt
frame2['debt'] = 16.5
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5
six,2003,Nevada,3.2,16.5


In [27]:
#Assigning the diffrent values to debt
frame2['debt'] = [16.5, 17.5, 18.5, 19.5, 20.5, 21.5]
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,17.5
three,2002,Ohio,3.6,18.5
four,2001,Nevada,2.4,19.5
five,2002,Nevada,2.9,20.5
six,2003,Nevada,3.2,21.5


In [28]:
#Assigning the values to debt  using numpy arange function
import numpy as np
frame2['debt'] = np.arange(6)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0
two,2001,Ohio,1.7,1
three,2002,Ohio,3.6,2
four,2001,Nevada,2.4,3
five,2002,Nevada,2.9,4
six,2003,Nevada,3.2,5


In [29]:
frame2.values

array([[2000, 'Ohio', 1.5, 0],
       [2001, 'Ohio', 1.7, 1],
       [2002, 'Ohio', 3.6, 2],
       [2001, 'Nevada', 2.4, 3],
       [2002, 'Nevada', 2.9, 4],
       [2003, 'Nevada', 3.2, 5]], dtype=object)

Index Objects: pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names)

In [30]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [31]:
obj

a    0
b    1
c    2
dtype: int64

In [32]:
index[1:]

Index(['b', 'c'], dtype='object')

In [33]:
index['a'] = 8 # TypeError

TypeError: Index does not support mutable operations

#Notice that the new indexes are populated with NaN values. We can fill in the missing values using the fill_value parameter.

# Dropping Entries from an Axis

In [34]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj 

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [35]:
new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [36]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [37]:
data.drop(['Colorado'])

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Utah,8,9,10,11
New York,12,13,14,15


In [38]:
data.drop(['Colorado','Ohio', 'New York', 'Utah'])

Unnamed: 0,one,two,three,four


In [39]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [40]:
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


In [41]:
data[2:3] 

Unnamed: 0,one,two,three,four
Utah,8,9,10,11


In [42]:
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [43]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [44]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Selection with loc and iloc :::The main distinction between loc and iloc is: loc is label-based, which means that you have to specify rows and columns based on their row and column labels. iloc is integer position-based, so you have to specify rows and columns by their integer position values (0-based integer position).

In [46]:
data.loc['Colorado', ['two', 'four']]

two     5
four    7
Name: Colorado, dtype: int32

In [47]:
data.iloc[2, [3, 0, 1]]

four    11
one      8
two      9
Name: Utah, dtype: int32

In [48]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int32

In [49]:
data.iloc[:, :3]

Unnamed: 0,one,two,three
Ohio,0,0,0
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14
