# PANDAS

**INTRODUCTION  
Pandas is an open-source, Python library which provides easy-to-use data structures for the data analysis.**
**Pandas is great for data manipulation, data analysis, and data visualization.**

#### WHY PANDAS?

1. We can easily read and write from and to CSV files, or even databases
+ Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
+ We can manipulate the data by columns,.Columns can be inserted and deleted from DataFrame and higher dimensional objects
+ Intuitive merging and joining data sets
    5. Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
    6. Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving/loading data.


In [2]:
import pandas as pd
import numpy as np
%matplotlib inline

# Series
A series is a 1-D data structure. It is basically a labelled array that can hold different data types:
* int
* float
* String
* Python object
* many more

The data is aligned in a row fashon.

### Creating the series from the random data

In [3]:
# Generate random data for the series
data_for_series = np.random.randint(0, 100, size=(10))
print(data_for_series)

[28 98 98 56  4 12 54 56 76  9]


In [4]:
series = pd.Series(data_for_series)
series

0    28
1    98
2    98
3    56
4     4
5    12
6    54
7    56
8    76
9     9
dtype: int64

#### Information about series we created

In [5]:
series.describe()

count    10.000000
mean     49.100000
std      35.101282
min       4.000000
25%      16.000000
50%      55.000000
75%      71.000000
max      98.000000
dtype: float64

### Accessing the data

Using **head()**, we can find the data from the top. If no parameter is passed, the **head()** function displays the first 5 data.

In [6]:
series.head()

0    28
1    98
2    98
3    56
4     4
dtype: int64

In [7]:
series.head(3)

0    28
1    98
2    98
dtype: int64

In [8]:
series.tail(4)

6    54
7    56
8    76
9     9
dtype: int64

### Create series with our custom index

In [9]:
index_for_series = 'A B C D E F G H I J'.split()
index_for_series

['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']

In [10]:
series2 = pd.Series(data_for_series, index=index_for_series)
series2

A    28
B    98
C    98
D    56
E     4
F    12
G    54
H    56
I    76
J     9
dtype: int64

### Using **loc** and **iloc**  
* **loc** gets the rows using the particular ***label*** from the index.
* **iloc** get the rows using the particular ***position*** of the index. (Note: iloc only takes integers)

In [11]:
series2.loc['C']

98

In [12]:
series2.iloc[2]

98

In [13]:
series2.loc[['C','D']]

C    98
D    56
dtype: int64

In [14]:
series2.iloc[2:4]

C    98
D    56
dtype: int64

#### Note: 
We get error if we try the following

```python
s_c.loc[1]
s_c.iloc['A']
```

### Slicing the series

In [15]:
series2[2:5]

C    98
D    56
E     4
dtype: int64

### Creating series from dictionary

In [16]:
data_dict = {
    'A':1,
    'B':100,
    'C':12,
    'D':14,
    'E':155,
    'F':22,
    'G':123,
    'H':21,
    'I':51,
    'J':74,
}

In [17]:
series3 = pd.Series(data_dict)
series3

A      1
B    100
C     12
D     14
E    155
F     22
G    123
H     21
I     51
J     74
dtype: int64

## DATAFRAME

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. A pandas DataFrame can be created using the following constructor −

```python
pandas.DataFrame( data, index, columns, dtype)
```

### Creating dataframe from dictionary

In [50]:
data = [['Harry', 15], ['John', 14], ['Andrew', 13]]
df = pd.DataFrame(data, columns=['Name','Age'])
df

Unnamed: 0,Name,Age
0,Harry,15
1,John,14
2,Andrew,13


### Creating dataframe from random numbers

In [51]:
data = np.random.randint(0,10,(5,4)) #Ranging from 0-10 with 5*4 matrix
print(data)

[[7 4 3 9]
 [8 8 9 5]
 [9 1 0 4]
 [8 0 9 5]
 [6 4 5 0]]


In [52]:
my_index = '1 2 3 4 5'.split()
print(my_index)

['1', '2', '3', '4', '5']


In [53]:
df = pd.DataFrame(data,index=my_index,columns='A B C D'.split())
df

Unnamed: 0,A,B,C,D
1,7,4,3,9
2,8,8,9,5
3,9,1,0,4
4,8,0,9,5
5,6,4,5,0


### Checking the information

In [54]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 1 to 5
Data columns (total 4 columns):
A    5 non-null int64
B    5 non-null int64
C    5 non-null int64
D    5 non-null int64
dtypes: int64(4)
memory usage: 200.0+ bytes


In [55]:
df.describe()

Unnamed: 0,A,B,C,D
count,5.0,5.0,5.0,5.0
mean,7.6,3.4,5.2,4.6
std,1.140175,3.130495,3.898718,3.209361
min,6.0,0.0,0.0,0.0
25%,7.0,1.0,3.0,4.0
50%,8.0,4.0,5.0,5.0
75%,8.0,4.0,9.0,5.0
max,9.0,8.0,9.0,9.0


In [56]:
type(df)

pandas.core.frame.DataFrame

### Indexing columns

In [57]:
df[['B','D']] #df.B can also be used

Unnamed: 0,B,D
1,4,9
2,8,5
3,1,4
4,0,5
5,4,0


### Using iloc and loc
Here we show the equivalent iloc vs loc

In [58]:
df

Unnamed: 0,A,B,C,D
1,7,4,3,9
2,8,8,9,5
3,9,1,0,4
4,8,0,9,5
5,6,4,5,0


In [59]:
df.loc['1']

A    7
B    4
C    3
D    9
Name: 1, dtype: int64

In [60]:
df.iloc[1]

A    8
B    8
C    9
D    5
Name: 2, dtype: int64

In [61]:
df.loc[:, ['B']]

Unnamed: 0,B
1,4
2,8
3,1
4,0
5,4


In [62]:
df.iloc[:, [0]]

Unnamed: 0,A
1,7
2,8
3,9
4,8
5,6


In [63]:
df.loc[['2','4','5'], ['B','D']]

Unnamed: 0,B,D
2,8,5
4,0,5
5,4,0


In [64]:
df.iloc[[1, 3, 4],[1, 3]]

Unnamed: 0,B,D
2,8,5
4,0,5
5,4,0


### Combine two columns

In [65]:
df['Sum'] = df['A'] + df['B']
df

Unnamed: 0,A,B,C,D,Sum
1,7,4,3,9,11
2,8,8,9,5,16
3,9,1,0,4,10
4,8,0,9,5,8
5,6,4,5,0,10


### Dropping a column

In [66]:
df.drop('Sum',axis=1) #Column drop axis=1

Unnamed: 0,A,B,C,D
1,7,4,3,9
2,8,8,9,5
3,9,1,0,4
4,8,0,9,5
5,6,4,5,0


In [67]:
df #No changes in actual dataframe

Unnamed: 0,A,B,C,D,Sum
1,7,4,3,9,11
2,8,8,9,5,16
3,9,1,0,4,10
4,8,0,9,5,8
5,6,4,5,0,10


#### Note:
We can see that the column is not dropped from the dataframe. To save the changes we need to set a flag ***inplace*** specifying the changes to occur in the dataframe

In [68]:
df.drop('Sum', axis=1, inplace=True)

In [69]:
df

Unnamed: 0,A,B,C,D
1,7,4,3,9
2,8,8,9,5
3,9,1,0,4
4,8,0,9,5
5,6,4,5,0


In [70]:
df.drop('1', axis=0, inplace=True)

In [71]:
df

Unnamed: 0,A,B,C,D
2,8,8,9,5
3,9,1,0,4
4,8,0,9,5
5,6,4,5,0


In [72]:
df>0

Unnamed: 0,A,B,C,D
2,True,True,True,True
3,True,True,False,True
4,True,False,True,True
5,True,True,True,False


In [74]:
df[df>5] = 'Changed'
df

Unnamed: 0,A,B,C,D
2,Changed,Changed,Changed,5
3,Changed,1,0,4
4,Changed,0,Changed,5
5,Changed,4,5,0
