# Pandas



Pandas is the most popular python library used for data science and with good reason: it offers powerful, expressive and flexible data structures that make data manipulation and analysis easy.

What’s cool about Pandas is that it takes data (like a CSV or Excel or a SQL database) and creates a Python object with rows and columns called *dataframe* that looks very similar to table in a statistical software (think Excel).

As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.


    

### Installation and getting started

In [None]:
import numpy as np
import pandas as pd 

### Pandas Data Structures

At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.

#### The Pandas Series Object

<br>
A one-dimensional labeled array capable of holding any data type(integer, string, float, python objects).The axis labels are collectively called *index*.



In [None]:
#creating a series from ndarray
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print(s)

We did not pass any index, so by default, it assigned the indexes ranging from 0 to len(data)-1, i.e., 0 to 3.

In [None]:
#creating a series and passing the index values
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print(s)

In [None]:
#creating a series from a dictionary
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print(s)

**Observe** − Dictionary keys are used to construct index.

In [None]:
s = pd.Series([3, -5, 7, 4],  index=['a',  'b',  'c',  'd'])
print(s)

### Accessing Data from Series with Position
Data in the series can be accessed similar to that in an ndarray.

In [None]:
#retrieve the first element
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
print(s[0])

In [None]:
#retrieve the first 3 elements
print(s[:3])

### Retrieve Data Using Label (Index)
A Series is like a fixed-size dict in that you can get and set values by index label.

In [None]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])

#retrieve a single element
print(s['a'])

In [None]:
#retrieve multiple elements using a list of index label values
print(s[['a','c','d']])

### The Pandas Series DataFrame object <br>
A two-dimensional labeled data structure with columns of potentially different types. <br>
In general, you could say that the Pandas DataFrame consists of three main components: the data, the index, and the columns.



A DataFrame can contain data that is:
- a Pandas `DataFrame`
- a Pandas `Series`
- a NumPy `ndarray`
- dictionaries of one-dimensional `ndarray`’s, lists, dictionaries or Series.

Note the difference between `np.ndarray` and `np.array()`.The former is an actual data type, while the latter is a function to make arrays from other data structures.


In [None]:
#create a dataframe from lists
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
df

In [None]:
#create dataframe from dictionary of lists/ndarrays
#all the arrays mut be of the same length
data = {'Country': ['Belgium',  'India',  'Brazil'],

        'Capital': ['Brussels',  'New Delhi',  'Brasilia'],

        'Population': [11190846, 1303171035, 207847528]}

df = pd.DataFrame(data, columns=['Country',  'Capital',  'Population'])
df

In [None]:
#create a dataframe from a dictionary and pass an index
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
df

### Loading and Saving Data with Pandas

Usually, when we use pandas for data analysis, we’ll use it in one of three different ways:
- Convert a Python’s list, dictionary or Numpy array to a Pandas data frame
- Open a local file using Pandas, usually a CSV or Excel file
- Open a remote file or database like a CSV or a JSONon a website through a URL or read from a SQL table/database




**Read and Write to CSV**


In [None]:
#write to csv file
df.to_csv('csv_example.csv')

#read from csv file
df_csv = pd.read_csv('csv_example.csv')
df_csv

Well, we can see that the index is generated twice, the first one is loaded from the CSV file, while the second one, i.e `Unnamed` is generated automatically by Pandas while loading the CSV file.

This problem can be avoided by making sure that the writing of CSV files doesn’t write indexes, because `DataFrame` will generate it anyway. We can do the same by specifying `index = False` parameter in `to_csv(...)` function.


In [None]:
#write
df.to_csv('csv_example.csv', index=False)

#read
df_csv = pd.read_csv('csv_example.csv')
df_csv

**Read and Write to Excel**

In [None]:
#write to excel
df.to_excel('excel-example.xlsx', index=False, sheet_name='Sheet1')

#read from excel
df_excel = pd.read_excel('excel-example.xlsx')
df_excel

### Selection of Data

In [None]:
data = {'Country': ['Belgium',  'India',  'Brazil'],

        'Capital': ['Brussels',  'New Delhi',  'Brasilia'],

        'Population': [11190846, 1303171035, 207847528]}

df = pd.DataFrame(data, columns=['Country',  'Capital',  'Population'])

In [None]:
#get subset of a DataFrame
df[1:] #starting from row with index 1 and all columns


In [None]:
#select the first row 
df.iloc[0]

By Position <br>





In [None]:
# Select single value by row and and column
#first element of the first column
df.iloc[[0],[0]]

In [None]:
#Select multiple values by row and and column
df.iloc[1:3,[1]] #rows 1 and 2 and just column with index 1-> Capital

In [None]:
df.iloc[[0,2],[1]] #rows with index 0 and 2 and column indexd at 1-> Capital

By Label

In [None]:
#Select single value by row and column labels
df.loc[1, 'Country']

In [None]:
#Select multiple values by row and column labels
df.loc[0:1,['Capital', 'Country']]

### Filter, Sort and Groupby


In [None]:
#You can use different conditions to filter columns
df[df['Population']<12000000]

In [None]:
#sort values in a column in ascending order using df.sort_values(col1)
df.sort_values('Country')

In [None]:
#or in dscending order
df.sort_values('Country', ascending=False)

In [None]:
#also possible to sort ascending by one column and descending by another one
df.sort_values(['Country','Population'],ascending=[True,False])

### Viewing and Inspecting Data
**Basic information**

In [None]:
#get the first 2 rows
df.head(2)

In [None]:
#get the last 2 rows
df.tail(2)

In [None]:
#get the nr of rows and columns
print(df.shape)

#describe the columns
print(df.columns)

#get the index, datatype and memory information
df.info()

#Number of non-NA values
df.count()

In [None]:
#returns the actual data in the dataframe as an ndarray
df.values

**Summary information**
<br>
To get statistics on the entire dataframe or a series we can use: 
- `df.mean()` Returns the mean of all columns
- `df.corr()` Returns the correlation between columns in a data frame
-  `df.count()` Returns the number of non-null values in each data frame column
-  `df.max()` Returns the highest value in each column
-  `df.min()` Returns the lowest value in each column
-  `df.median()` Returns the median of each column
- `df.std()` Returns the standard deviation of each column

In [None]:
#creating a dataframe to try our some functions
data = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.78,2.98,4.80,4.10,3.65])
}
df = pd.DataFrame(data)
df

In [None]:
df.sum()

In [None]:
df.corr()

In [None]:
df.dropna()

In [None]:
df.max()

###  Data Cleaning

Data cleaning is a very important step in data analysis.
- `df.isnull()` checks for null values and returns a boolean array
- `df.notnull()` is the opposit of `df.isnull()`, returns boolean array too
- `df.dropna()` drops rows with empty values
- `df.dropna(axis=1)` drops columns with empty values
- `df.fillna(x)` which fills the missing values with x
- `df.replace(old,new)` replaces all old values with new one

In [None]:
df.isnull()

In [None]:
df.notnull()

In [None]:
df.dropna(axis=1)

In [None]:
df.fillna(1.2)

In [None]:
df.replace(25,555)

### Join/Combine

The last set of basic Pandas commands are for joining or combining data frames or rows/columns. The three commands are:
- `df1.append(df2)` — add the rows in df1 to the end of df2 (columns should be identical)
- `pd.concat([df1, df2],axis=1)` — add the columns in df1 to the end of df2 (rows should be identical)
- `df1.merge(df2,on=col1,how='inner')` — SQL-style join the columns in df1 with the columns on df2 where the rows for `col` have identical values. `how` variable can be equal to one of: 'left', 'right', 'outer', 'inner'.


In [None]:
#creating first dataframe
data1 = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.78,2.98,4.80,4.10,3.65])
}
df1 = pd.DataFrame(data1)

#creating second dataframe
data2 = {'Name':pd.Series(['Andreea','Vlad','Carlo']),
   'Age':pd.Series([21,21,21]),
   'Rating':pd.Series([4.,5,4.6])
}
df2 = pd.DataFrame(data2)

#using append
df3 = df1.append(df2)
df3

In [None]:
#creating first dataframe
data1 = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve']),
   'Age':pd.Series([25,26,25,23,30]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20])
}
df1 = pd.DataFrame(data1)

#creating second dataframe
data2 = {'Ocupation':pd.Series(['Student','Professional','Student','Student',
                                'Professional'])
}
df2 = pd.DataFrame(data2)

#using concat
df3 = pd.concat([df1, df2],axis=1)
df3

In [None]:
#creating first dataframe
data1 = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve']),
   'Age':pd.Series([25,26,25,23,30]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20])
}
df1 = pd.DataFrame(data1)

#creating second dataframe
data2 = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve']),
        'Ocupation':pd.Series(['Student','Professional','Student','Student',
                                'Professional'])
}
df2 = pd.DataFrame(data2)

#using merge
df3 = df1.merge(df2,on='Name',how='left') 
df3