<img src="https://drive.google.com/uc?id=1p2lP14-7YdSEb7bzIPrFGjvQyDvlhtqt" align = "right">


## **Importing the package**

In [1]:
import numpy as np  
import pandas as pd  

# np and pd are called 'aliases'

## **Types of data structures in pandas**




## 1. Series

### A series can be seen as a one-dimensional array. The data structure can hold any data type, that is including strings, integers, floats and Python objects.

In [2]:
# Series with character as datatype
s = pd.Series(['a','b','c'])

# Series with integers as datatype
i = pd.Series([1,2,3,4,5])

# Series with dictonary
d = pd.Series({ 'India': 100, 'US': 80, 'Canada': 60, 'France': 40,'UK': 20})

In [3]:
# printing the first element
print(s[0])

###
print("-"*200)
print("-"*200)
###

# printing an array of elements
print(i[:4])

a
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
0    1
1    2
2    3
3    4
dtype: int64


## **Hey! Series are actually numpy arrays then, right?**

In [5]:
#  We can use datatype other than inetegers for indices
s_with_index = pd.Series([1, 7, 2], index=["x", "y", "z"])

print(s_with_index['x'])

for i in ['x','y','z']:
  print(s_with_index[i])

1
1
7
2


In [6]:
d = pd.Series({ 'India': 100, 'US': 80, 'Canada': 60, 'France': 40,'UK': 20})
print(d)

###
print('-'*200)
print('-'*200)
###

a = np.array({'a':1, 'b':2})
print(a)

India     100
US         80
Canada     60
France     40
UK         20
dtype: int64
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
{'a': 1, 'b': 2}


In [13]:
d_arr = d.values
# print(type(d_arr))

### Series use numpy arrays as a primitive data structure in the backend but provides a wide range of functionality beyond the scope of numpy arrays

## 2. DataFrame

In [30]:
data = {
  "calories": [510, 490, 240],
  "fat": [5, 4, 6]
}

df = pd.DataFrame(data)
df
# print(df)

Unnamed: 0,calories,fat
0,510,5
1,490,4
2,240,6


Indexing datatype can be changed similar to Series

1.   List item
2.   List item



In [31]:
df = pd.DataFrame(data, index=['x','y','z'])
df

Unnamed: 0,calories,fat
x,510,5
y,490,4
z,240,6


In [None]:
df.calories['x']

In [15]:
cal_1 = df.calories
cal_2 = df['calories']

# print(type(cal_1),type(cal_2))

## 3. Functions in Pandas

### Let's import a dataset using a pandas function

The function read_csv reads the dataset from a csv file. There are also other functions for different file like read_excel

In [32]:
data = pd.read_csv('bacteria_train.csv')

# data = pd.to_csv("")

Pandas registers the data read from a dataset as a dataframe by default

In [33]:
data

Unnamed: 0,Perc_population,Spreading_factor
0,1.535,0.190708
1,5.555,0.326928
2,-0.277,-0.459699
3,1.724,-0.193013
4,-0.550,-0.835745
...,...,...
418,-0.461,-1.536030
419,2.179,1.445470
420,6.328,1.107800
421,3.854,0.841114


The rows in the above dataframe are called samples(if it is a subset of a large dataset) or dataset instances(if you consider the whole population).

In machine learning terminology, the columns are called features of the dataset. 

In [28]:
data.columns

Index(['Perc_population', 'Spreading_factor'], dtype='object')

The rename function can be used to rename the columns

In [None]:
data.rename(columns = {"Perc_population": "perc_pop"})

The functions head and tail give a peek of the dataset

In [18]:
data.head(5)

Unnamed: 0,Perc_population,Spreading_factor
0,1.535,0.190708
1,5.555,0.326928
2,-0.277,-0.459699
3,1.724,-0.193013
4,-0.55,-0.835745


In [19]:
data.tail(5)

Unnamed: 0,Perc_population,Spreading_factor
418,-0.461,-1.53603
419,2.179,1.44547
420,6.328,1.1078
421,3.854,0.841114
422,0.987,-1.87563


The shape function provides information about the dimensions of the dataset

In [20]:
data.shape

(423, 2)

The info function provides you with the general structure of the dataset

In [21]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 423 entries, 0 to 422
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Perc_population   423 non-null    float64
 1   Spreading_factor  423 non-null    float64
dtypes: float64(2)
memory usage: 6.7 KB


The function isna returns boolean value for each datapoint in the dataset i.e. either true or false by checking whether a number is NaN or not

NaN , standing for not a number, is a numeric data type used to represent any value that is undefined or unpresentable. For example, 0/0 is undefined as a real number and is, therefore, represented by NaN.


In [22]:
data.isna()

Unnamed: 0,Perc_population,Spreading_factor
0,False,False
1,False,False
2,False,False
3,False,False
4,False,False
...,...,...
418,False,False
419,False,False
420,False,False
421,False,False


The describe function provides you with the information about the statistics of variables like count, mean, standard deviation, etc

The 25%, 50%, 75% values are called quartiles of a dataset. Quartiles are the set of values which has three points dividing the data set into four identical parts.

In [23]:
data.describe()

Unnamed: 0,Perc_population,Spreading_factor
count,423.0,423.0
mean,2.266849,0.012332
std,2.537024,0.993033
min,-0.998,-1.90824
25%,0.1195,-0.832867
50%,1.44,0.142743
75%,4.1515,0.841114
max,7.558,1.90402


The function unique provides information about the total number of unique values in a particular column of the dataset.

In machine learning terminology, we understand how many different types of values can a feature have across all the samples of the dataset

In [25]:
data.nunique()

Perc_population     412
Spreading_factor    360
dtype: int64

The unique function will return all the unique values that a feature can have

In [None]:
data.Perc_population.unique()

The value_counts functions will return the count for each unique value present in the dataset

In [27]:
data.Perc_population.value_counts()

-0.878    3
 1.047    2
 6.025    2
 6.726    2
 1.008    2
         ..
 0.323    1
-0.998    1
 5.564    1
-0.236    1
 0.987    1
Name: Perc_population, Length: 412, dtype: int64

Arbitrary functions can be applied along the axes of a DataFrame using the apply() method.

In [37]:
data.apply(np.mean)

# data.apply(np.mean, axis=1)

Perc_population     2.266849
Spreading_factor    0.012332
dtype: float64

Using lambda functions, it is even possible to apply your own customised operations on the datapoints

In [39]:
data.apply(lambda x: x.max() - x.min())

Perc_population     8.55600
Spreading_factor    3.81226
dtype: float64

In [41]:
data.drop('Perc_population')

The function iloc takes as a parameter the rows and column indices and gives you the subset of the DataFrame accordingly

In [43]:
data.iloc[:10,1:]

The function loc() does almost the similar operation as iloc() function, but here we can specify exactly which row index we want and also the name of the columns we want in our subset

In [49]:
data.loc[[300],['Perc_population','Spreading_factor']]

Unnamed: 0,Perc_population,Spreading_factor
300,0.907,-0.051036


In [50]:
data.dtypes

Perc_population     float64
Spreading_factor    float64
dtype: object

The insert function inserts a column in the specified position

In [73]:
random_col = np.random.randint(100, size=len(data))

data.insert(2, 'random_col', random_col)
data

Unnamed: 0,Perc_population,Spreading_factor,random_col
0,1.535,0.190708,76
1,5.555,0.326928,55
2,-0.277,-0.459699,30
3,1.724,-0.193013,15
4,-0.550,-0.835745,50
...,...,...,...
418,-0.461,-1.536030,75
419,2.179,1.445470,12
420,6.328,1.107800,63
421,3.854,0.841114,20


The sample function will randomly sample 100 datapoints from the dataset

In [75]:
data.sample(100)

Unnamed: 0,Perc_population,Spreading_factor,random_col
322,6.999,0.965823,79
148,-0.040,-0.363768,40
284,6.168,1.612390,69
100,1.877,0.466986,66
125,-0.878,-1.687600,21
...,...,...,...
212,5.865,1.105880,64
79,2.233,1.044490,11
274,5.705,1.180710,61
235,-0.824,-0.356094,23


In [82]:
data.replace(1.535, 1.260)

#data.replace({1.260 : 1.120, 5.555 : 0.435, 1.724 : 2.675})

# **Advantages**

* ###    Fast and efficient for manipulating and analyzing data.
* ###    Data from different file objects can be loaded.

* ###  Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data

* ###  Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects

* ###  Provides all the functionality you require to handle databases like queries, merging and joining two databases, flexible reshaping and pivoting of data, etc.




