In [1]:
import pandas as pd

In [2]:
# Creating a series using list of element
l = [1, 1, 2, 3, 5, 8, 13]

pd.Series(l)

0     1
1     1
2     2
3     3
4     5
5     8
6    13
dtype: int64

In [3]:
# Creating a  series using dictionary
l = {"one": 1, "two": 2, "three": 3, "four": 4, "five": 5, "six": 6, "seven": 7}

pd.Series(l)

one      1
two      2
three    3
four     4
five     5
six      6
seven    7
dtype: int64

In [4]:
# Create a DataFrame from a list
data =[[1000, "Steve", 86.29],
      [1001, "Mathew", 91.63],
      [1002, "Jose", 72.90],
      [1003, "Patty", 69.23],
      [1004, "Vin", 88.30]]
pd.DataFrame(data, columns=["id_no", "name", "point"])

Unnamed: 0,id_no,name,point
0,1000,Steve,86.29
1,1001,Mathew,91.63
2,1002,Jose,72.9
3,1003,Patty,69.23
4,1004,Vin,88.3


In [5]:
# Create a DataFrame using dictionary
data ={"Regd. No": [1000, 1001, 1002, 1003, 1004],
        "Names": ["Steve", "Mathew", "Jose", "Patty", "Vin"],
      "Marks%": [86.29, 91.63, 72.90, 69.23, 88.30]}
pd.DataFrame(data)

Unnamed: 0,Regd. No,Names,Marks%
0,1000,Steve,86.29
1,1001,Mathew,91.63
2,1002,Jose,72.9
3,1003,Patty,69.23
4,1004,Vin,88.3


### Pandas Objects
At the very basic level, Pandas objects can be thought of as enchanced versions of Numpy arrays identified with labels rather than simple integer indices

In [6]:
import numpy as np
import pandas as pd

### Pandas series is a one-dimensional array of indexed data. It can be created from a list or array  as follows

In [7]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

As we see in the output, the Series wraps both a sequence of values and a sequence of indices, which we can attribute. The values are simply a familiar Numpy array.

In [8]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

The index is an array-like object of type pd.index

In [9]:
data.index

RangeIndex(start=0, stop=4, step=1)

Like with a Numpy array, data can be accessed by the associated index via the familiar Python square-bracket

In [10]:
data[1]

0.5

In [11]:
data[1:3]

1    0.50
2    0.75
dtype: float64

As we will see, though, the Pandas Series is much more general and flexible than the one-dimensional Numpy

Series as generalized NumPy array

Numpy Array has an implicitly defined integer index used to access the values, the pandas Series has an  example with the values.

The Explicit index definition gives the series object additional capabilities. For exaample, the index need not values of any desired type.

In [12]:
# manipulating the index of our series
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=["a", "b", "c", "d"])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [13]:
data["a":"d"]

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

We can now use non-contiguous or non_sequential indices:

In [14]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [15]:
data[5]

0.5

### Series as specialized dictionary
A dictionary is a structure that maps arbitary keys to a set of arbitray values, and a Series is a structure when type values

In [16]:
population = {"california":39332521,
                  "Texas": 26448193,
                  "New York": 19651127,
                  "Florida": 19552860,
                  "Illinois": 12882135}
population_series = pd.Series(population)
population_series

california    39332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

By default, a Series will be created where the index is drawn from the sorted keys. From here, typical diction can be performed.

In [17]:
population["california"]

39332521

Unlike a dictionary, though, the Series also supports array_style operations such as slicing

In [18]:
population_series["california": "Florida"]

california    39332521
Texas         26448193
New York      19651127
Florida       19552860
dtype: int64

### Pandas DataFrame Object
A DataFrame is an analog of a two-dimensional array with flexible row indices and flexible column name two-dimensional array as an orderd seqences of aligned one-dimensional columns, you can think of a Data Series objects. Hereby "aligned" we mean that they share the same index.

To demonstrate this, let's first construct a new Series listing the area of each of the five states dicscussed above

In [19]:
area_dict = {"california":423967, "Texas": 695662, "New York": 141297,
             "Florida": 170312, "Illinois": 149995}
area_series= pd.Series(area_dict)
area_series

california    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

Now that we have this along with the population Series from before, we can use a dictionary to construct.

In [20]:
states = pd.DataFrame({"population": population_series, "area": area_series})
states

Unnamed: 0,population,area
california,39332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


Like the Series object, the DataFrame has an index attribute that gives access to the index labels:

In [21]:
states.index

Index(['california', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

Additionally, the DataFrame has a columns attribute, which is an index object holding the column labels

In [22]:
states.columns

Index(['population', 'area'], dtype='object')

Thus the DataFrame can be thought of as a generalization of a two-dimensional NumPy array

### Pandas Index Object

This Index object is an interesting structure in itself, and it can be thought of either as an immmutable array 

In [23]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

In [24]:
ind[3]

7

In [25]:
ind[::2]

Int64Index([2, 5, 11], dtype='int64')

In [26]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


One difference between Index objects and Numpy arrays is that indices are immutable- that is, they cannot be changed

In [27]:
ind[1] = 0

TypeError: Index does not support mutable operations

### Data Indexing and Selection

#### Data Selection in Series

In [28]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                index= ["a", "b", "c", "d"])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [29]:
data["e"] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

In [30]:
# slicing by explicit index
data["a":"c"]

a    0.25
b    0.50
c    0.75
dtype: float64

In [31]:
# slicing by implicit integer index
data[0:2]

a    0.25
b    0.50
dtype: float64

In [32]:
data["a":"e"]

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

In [33]:
# fancy indexing
data[["a", "e","c"]]

a    0.25
e    1.25
c    0.75
dtype: float64

Notice that when slicing with an explicit index(i.e, data['a': 'c']), the final index is included, while slicing with values(i.e, data[0:2]), the final index is excluded from the slice.

##### Indexers: loc, iloc
The loc method takes the column name for indexing, while iloc takes the integer location for indexing.

In [34]:
data = pd.Series(["a", "b", "c"], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

First, the loc attribute allows indexing and slicing that always references the explicit index:

In [35]:
data.loc[1]

'a'

In [36]:
data.loc[1:3]

1    a
3    b
dtype: object

The iloc attribute allows indexing and sclicing that always references the implicit Python-sytle index

In [37]:
data.iloc[2]

'c'

In [38]:
data.iloc[1:3]

3    b
5    c
dtype: object

In [39]:
data = pd.Series(["a", "b", "c", "d", "e"], index=[1, 2, 3, 4, 5])

In [40]:
data.loc[1:3]

1    a
2    b
3    c
dtype: object

In [41]:
area_1 = pd.Series({"California":423967, "Texas": 695662, "New York": 141297,
                    "Florida": 170312, "Illinois": 149995})
pop_1 = pd.Series({"California":38332521, "Texas": 26448193, 
                   "New York": 19651127, "Florida": 19552860, 
                   "Illinois": 12882135})

data = pd.DataFrame({"area":area_1, "pop": pop_1})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


The individual Series that make up the columns of the the DataFrame can be accessed via dictionry-style index

In [42]:
data["area"]

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Equivalently, we can use attribute-style access with column names that are strings:

In [43]:
data.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Like with the series object discusses earlier, this dictionary-sytle syntax can also be used to modify the column: 

In [45]:
data["density"] = data["pop"] / data["area"]
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [46]:
data["density"]

California     90.413926
Texas          38.018740
New York      139.076746
Florida       114.806121
Illinois       85.883763
Name: density, dtype: float64

we can also view the DataFrame as an enhanced two-dimensional array. We can examine the raw underlying data array attribute:

In [47]:
data.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [48]:
data.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

In [49]:
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [50]:
data.iloc[:, :]

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [51]:
# use the loc method to select "cal, tex, new" where area and pop are greater
# than 1500000

data.loc[["California", "Texas", "New York"],["area","pop"]]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [52]:
data.loc[data.pop > 2000000, ["area","pop", "density"]]

TypeError: '>' not supported between instances of 'method' and 'int'

In [53]:
data.pop

<bound method DataFrame.pop of               area       pop     density
California  423967  38332521   90.413926
Texas       695662  26448193   38.018740
New York    141297  19651127  139.076746
Florida     170312  19552860  114.806121
Illinois    149995  12882135   85.883763>

In [54]:
data.loc[:"New York",:"pop"][data.loc[:"New York",:"pop"] > 200000]

Unnamed: 0,area,pop
California,423967.0,38332521
Texas,695662.0,26448193
New York,,19651127


In [55]:
data.pop

<bound method DataFrame.pop of               area       pop     density
California  423967  38332521   90.413926
Texas       695662  26448193   38.018740
New York    141297  19651127  139.076746
Florida     170312  19552860  114.806121
Illinois    149995  12882135   85.883763>

In [56]:
#Use the loc method to select "California, Texas, New York" where area and pop are greater than 150000000
data.loc[data.pop > 2000000, ["area", "pop", "density"]]

TypeError: '>' not supported between instances of 'method' and 'int'

In [57]:
data.loc[[data.pop > 20000000],[data.Area > 200000]]

TypeError: '>' not supported between instances of 'method' and 'int'

In [58]:
data[data.loc[:"New York",:"pop"] > 1500000]

Unnamed: 0,area,pop,density
California,,38332521.0,
Texas,,26448193.0,
New York,,19651127.0,
Florida,,,
Illinois,,,


In [59]:
data.iloc[2:, :2]

Unnamed: 0,area,pop
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [60]:
data.loc[:"Florida", :"pop"]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860


In [45]:
data[data.density >100]

Unnamed: 0,area,pop,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121


### Handling Missing Data

The difference between data found in many tutorials and data in the real world is that real-world data is rarely clean and contain alot of missing data, many interesting datasets will have amount of data missing. To make matters even more complicated, there might be missing data in diferent columns

#### NaN and None in Pandas
NaN and None both have their place, and Pandas is built to handle the two of them nearly interchangeable, converting between them appropiately:

In [61]:
import numpy as np
import pandas as pd

In [67]:
def greet():
    print("welcome")

In [71]:
greet()

welcome


In [72]:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

### Operations on Null values

There are several useful methods for detecting, removing, and replacing null values in Pandas data structures. 

They are:
  - isnull() : Generate a boolean mask indicating missing values
  - notnull() : Opposite of isnull()
  - dropna() : Return a filtered version of the data
  - fillna() : Return a copy of data with missing values filled or imported

In [84]:
data = pd.Series([1, np.nan, "hello", None])

In [76]:
data

0        1
1      NaN
2    hello
3     None
dtype: object

In [80]:
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [78]:
data[data.notnull()]

0        1
2    hello
dtype: object

### Dropping null values

#### Method to remove NA values

In [85]:
# The dropna method drops the missing data
data.dropna()

0        1
2    hello
dtype: object

In [83]:
data

0        1
2    hello
dtype: object

For a DataFrame, there are more options, Consider the following DataFrame

In [87]:
df = pd.DataFrame([[1, np.nan, 2],
                  [2, 3, 5],
                  [np.nan, 4, 6]])

df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


We cannot drop single values from a DataFrame; we can drop full rows or full column or the other, so dropna() gives a number of options for a DataFrame. By default, dropna() will drop all rows in which any null value is present:

In [91]:
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [89]:
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [10]:
df.dropna(axis=0)

Unnamed: 0,0,1,2
1,2.0,3.0,5


Alternatively, you can drop NA values along different axis; axis=1 drops all columns containing a null value:

In [92]:
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [11]:
df.dropna(axis=1)

Unnamed: 0,2
0,2
1,5
2,6


But this drops some good data as well; you might rather be interested in dropping rows or columns with all NA values, or a match values. This can be specified through the "how" or "thresh" parameters, which allow fine control of the number of nulls to allow
The default is how="any", such that any row or column (depending on the axis keyword) containing a null value will be dropped, specify how="all", which will only drop rows/columns that are all null values:

In [93]:
df[3] =np.nan
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [13]:
df.dropna(axis= 1, how="all")

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [94]:
df.dropna(axis=1, how="any")

Unnamed: 0,2
0,2
1,5
2,6


In [95]:
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


### Filling null values
Sometimes rather than dropping NA values, you'd rather replace them with a valid value. This value might be a single number or might be some sort of imputation or interpolation from the good values. You could do this in-place using the isnull() method because it is such a common operation Pandas provides the fillna() method, which returns a copy of the array with the null values.

Consider the following Series:

In [96]:
data = pd.Series([1, np.nan, 2, None, 3], index=list("abcde"))
data

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [97]:
data.fillna(0)

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

We can specify a forward-fill to propagate the previous value forward:

In [107]:
data = pd.Series([4, np.nan, 2, 5,  None, 3, None], index=list("abcdefg"))

In [110]:
data.fillna(method="ffill")

a    4.0
b    4.0
c    2.0
d    5.0
e    5.0
f    3.0
g    3.0
dtype: float64

In [17]:
# forward-fill
#data.fillna(method="ffill")

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

Or we can specify a back-fill to propagate the next values backward:

In [111]:
data

a    4.0
b    NaN
c    2.0
d    5.0
e    NaN
f    3.0
g    NaN
dtype: float64

In [112]:
# back-fill
data.fillna(method="bfill")

a    4.0
b    2.0
c    2.0
d    5.0
e    3.0
f    3.0
g    NaN
dtype: float64

For DataFrame df, the options are similiar but we can also specify an axis along which the fills take place:

In [113]:
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [114]:
df.fillna(method="ffill", axis=1)

Unnamed: 0,0,1,2,3
0,1.0,1.0,2.0,2.0
1,2.0,3.0,5.0,5.0
2,,4.0,6.0,6.0


In [115]:
df.fillna(method="bfill", axis=1)

Unnamed: 0,0,1,2,3
0,1.0,2.0,2.0,
1,2.0,3.0,5.0,
2,4.0,4.0,6.0,


Notice that if a previous value is not available during a forward fill, the NA value remains

### Pandas String Operations

In [29]:
data = ["peter", "Paul", "MARY", "gUIDO"]
[s.capitalize() for s in data]

['Peter', 'Paul', 'Mary', 'Guido']

This is perharps sufficient to work with some data, but it will break if ther are any missing values. For example:

In [28]:
# List comprehension
[i**2 for i in range(10)]

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [9]:
data = ["peter", "Paul", "MARY", "gUIDO"]
[s.capitalize() for s in data]

['Peter', 'Paul', 'Mary', 'Guido']

In [30]:
names = pd.Series(data)
names

0    peter
1     Paul
2     MARY
3    gUIDO
dtype: object

We can now call a single method that will capitalize all the entries, while skipping over any missing values:

In [32]:
names.str.capitalize()

0    Peter
1     Paul
2     Mary
3    Guido
dtype: object

### String Methods
Here are some of Pandas str methods that mirror Python string methods:
  - len(), lower(), translate(), islower(), ljust(), upper(), startswith(), isupper(), rjust(), find(), endswith(), isnumeric()

In [33]:
monte = pd.Series(["Graham Chapman", "John Cleese", "Terry Gilliam",
                  "Eric Idle", "Terry Jones", "Micheal Palin"])

In [36]:
monte.str.lower()

0    graham chapman
1       john cleese
2     terry gilliam
3         eric idle
4       terry jones
5     micheal palin
dtype: object

In [37]:
monte.str.len()

0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

In [39]:
monte.str.startswith("T")

0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool

In [40]:
var = monte.str.split()

In [26]:
var

0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Micheal, Palin]
dtype: object

### Concat and Append
Here we'll take a look at simple concatenation of Series and DataFrame with the pd.cont function

In [42]:
ser1 = pd.Series(["A", "B", "C"], index=[1,2,3])
ser2 = pd.Series(["D", "E", "F"], index=[4,5,6])
pd.concat([ser1, ser2])

1    A
2    B
3    C
4    D
5    E
6    F
dtype: object

In [94]:
class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args =args
    
    def _repr_html_(self):
        return "\n".join(self.template.format(a, eval(a)._repr_html_())
                        for a in self.args)
    def __repr__(self):
        return "\n\n".join(a + "\n" + repr(eval(a))
                          for a in self.args)

In [95]:
def make_df(cola, ind):
    """Quickly make a DataFrame"""
    data = {c: [str(c) + str(i) for i in ind]
           for c in cola}
    return pd.DataFrame(data, ind)
# example DataFrame
make_df("ABC", range(3))

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2


It also works to concatenate higher-dimensional objects, such as DataFrame S:

In [97]:
df1 = make_df("AB", [1, 2])
df2 = make_df("AB", [3, 4])
display("df1", "df2", "pd.concat([df1, df2])")

Unnamed: 0,A,B
1,A1,B1
2,A2,B2

Unnamed: 0,A,B
3,A3,B3
4,A4,B4

Unnamed: 0,A,B
1,A1,B1
2,A2,B2
3,A3,B3
4,A4,B4


By default, the concatenation takes place row-wise within the DataFrame (i.e, axis=0). Like np.concatenate, pd.conat allow concatenation of an axis along which concatenation will take place

In [98]:
df3 = make_df("AB", [0, 1])
df4 = make_df("CD", [0, 1])
display("df3", "df4", "pd.concat([df3, df4], axis=1)")

Unnamed: 0,A,B
0,A0,B0
1,A1,B1

Unnamed: 0,C,D
0,C0,D0
1,C1,D1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
