# Creating, Reading and Writing Data

# Introduction to Pandas

**What is Pandas?**

1. Pandas is a fast, powerfull, flexible and easy to use open source data analysis and manipulation tool, built on top of the python programming language.
2.  Pandas is an open source library that is made mainly for working with relational or labeled data both easily and intuitively. 

### Why it is needed?

1. It provides various data structures and operations for manipulating numerical, categorical and time series data.
2. It is used for data cleaning and analysis.
3. It can be used to handle excel,csv,json,txt,paraquet etc file formats
4. It can be used to handle missing values
5. Can be used to create pivot tables
6. Can be used for filtering, sorting, grouping, performing different queries etc

### How to Import Pandas

1. Method -1
   In CMD write the following
   pip install pandas

2. Method -2
   In jupyter notebook write the following
   !pip install pandas

# Data Types In Pandas

*In pandas we have 2 fundamental data types*

1)<b>Series</b><br>
  a) It consists of single column data with one or more rows. Series is indexed.<br>
  b) A series can have multiple indexes as well.<br>

2)<b>DataFrame</b><br>
  a)It generally consists of muiltiple column data with one or more rows. DataFrame is indexed.<br>
  b) A DataFrame can have multiple indexes as well.<br>
  c) A DataFrame consist of justa single column.<br

# Importing Pandas library

In [15]:
import pandas as pd
# pandas is aliased as pd
import numpy as np
# numpy is aliased as np

## Creating a Series

1) By default Series indexes starts from 0.

In [17]:
s1 =  pd.Series([11,13,14,15,17])
print(type(s1))
s1

<class 'pandas.core.series.Series'>


0    11
1    13
2    14
3    15
4    17
dtype: int64

In [21]:
### Fetch series index and values

print(s1.index)       # Rangeindex (start = 0, stop = 5, step =1)
print(s1.index.tolist())     # [0,1,2,3,4]
print(s1.values)             # [11 13 14 15 17]
print(type(s1.values))       # numoy array

RangeIndex(start=0, stop=5, step=1)
[0, 1, 2, 3, 4]
[11 13 14 15 17]
<class 'numpy.ndarray'>


### Fetch element present at a particular index from a series.

In [23]:
print(s1)
print(s1[0],s1[2])
print(s1[1],s1[4])

0    11
1    13
2    14
3    15
4    17
dtype: int64
11 14
13 17


### Creating a series with custom index.

In [29]:
s2 = pd.Series([34,35,36,37,38],index=list(range(71,76)))
s2

71    34
72    35
73    36
74    37
75    38
dtype: int64

In [31]:
print(s2.index.tolist())
print(s2.values)

[71, 72, 73, 74, 75]
[34 35 36 37 38]


In [33]:
### Fetch element present at a particular a particular index from s2
print(s2[71],s2[74])
print(s2[73],s2[75])

34 37
36 38


### Creating aseries from list, a tuple, numpy array

In [35]:
a = [7,9,12,15,18]
w1 = pd.Series(a)
print(type(w1))
print(w1)

<class 'pandas.core.series.Series'>
0     7
1     9
2    12
3    15
4    18
dtype: int64


In [39]:
c =(12,16,18,19,24) # type(c) = tuple
w3 = pd.Series(c)
print(type(w3))
w3

<class 'pandas.core.series.Series'>


0    12
1    16
2    18
3    19
4    24
dtype: int64

### A series can not be created using a set

In [43]:
c1 = {12,15,18,21,24}    # type(c1) = set  = set is unordered
w4 = pd.Series(c1)
print(type(w4))
w4

TypeError: 'set' type is unordered

# DataFrame

### Methods to create a dataframe

1. Using Dictionary<br>
2. Using List of listbr
3. Using Series<br>
4. Using numpy array<br>
5. Reading csv,excel,json,paraquet file etc.

### 1) Creating a dataframe using a dictionary

**Keys of the dict will be used as the column names for the dataframe**

In [48]:
data = {'Name':['Ankit','Shikha','Pankaj','Vibhor','Ujjwal','Kartik'],
        'Scores':[78,91,84,75,85,93]}
print(type(data))
print(data)

<class 'dict'>
{'Name': ['Ankit', 'Shikha', 'Pankaj', 'Vibhor', 'Ujjwal', 'Kartik'], 'Scores': [78, 91, 84, 75, 85, 93]}


In [52]:
df1 = pd.DataFrame(data)
df1  # displaying df1

Unnamed: 0,Name,Scores
0,Ankit,78
1,Shikha,91
2,Pankaj,84
3,Vibhor,75
4,Ujjwal,85
5,Kartik,93


In [54]:
print(df1)

     Name  Scores
0   Ankit      78
1  Shikha      91
2  Pankaj      84
3  Vibhor      75
4  Ujjwal      85
5  Kartik      93


### 2) Creating a dataframe using a list of list

1. Each list is going to populate one row in the dataframe.
2. Column names by default will start from 0. Columns names must be explicitly provided.

In [56]:
data2 = [['Delhi','India',1],['Tokko','Japan',2],['Spain','Madrid',3]]
print(type(data2))
print(data2)

<class 'list'>
[['Delhi', 'India', 1], ['Tokko', 'Japan', 2], ['Spain', 'Madrid', 3]]


In [58]:
df2 = pd.DataFrame(data2)
#df2.columns = ['Capital','Country','Geo Code']
df2

Unnamed: 0,Capital,Country,Geo Code
0,Delhi,India,1
1,Tokko,Japan,2
2,Spain,Madrid,3


### 3) Creating a dataframe using series

In [60]:
ser1 = pd.Series(['Python','DS','ML','Tableau','Excel'])
ser2 = pd.Series([89,78,82,85,93])
df3 = pd.DataFrame({'Domains':ser1,'Marks':ser2})
df3

Unnamed: 0,Domains,Marks
0,Python,89
1,DS,78
2,ML,82
3,Tableau,85
4,Excel,93


In [63]:
sales = np.array([125,200,75,160])
units = np.array([5,10,3,8])
prod = np.array(['Coffee','Tea','Juice','Milk'])
df4 = pd.DataFrame({'Products':prod, 'Sales':sales, 'Units': units})
df4

Unnamed: 0,Products,Sales,Units
0,Coffee,125,5
1,Tea,200,10
2,Juice,75,3
3,Milk,160,8


In [65]:
print(type(df4['Sales']))
df4['Sales']

<class 'pandas.core.series.Series'>


0    125
1    200
2     75
3    160
Name: Sales, dtype: int32

In [67]:
print(type(df4[['Sales','Units']]))
df4[['Sales','Units']]

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Sales,Units
0,125,5
1,200,10
2,75,3
3,160,8


### 5) Creating a dataframe from another dataframe

In [69]:
df5 = df4[['Products','Sales']]
df5

Unnamed: 0,Products,Sales
0,Coffee,125
1,Tea,200
2,Juice,75
3,Milk,160


## Importing data using pandas

1)**From a csv file**

CSV stands for comma seperated value.

In [82]:
df_csv = pd.read_csv('test1.csv')
df_csv.head(3)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y


2)**From aexcel file**

In [100]:
df_excel = pd.read_excel('Sample_Data.xlsx')
df_excel.head()

Unnamed: 0,OrderDate,Region,Rep,Item,Units,Unit Cost,Total
0,2016-01-06,East,Jones,Pencil,95,1.99,189.05
1,2016-01-23,Central,Kivell,Binder,50,19.99,999.5
2,2016-02-09,Central,Jardine,Pencil,36,4.99,179.64
3,2016-02-26,Central,Gill,Pen,27,19.99,539.73
4,2016-03-15,West,Sorvino,Pencil,56,2.99,167.44


**3)From a parquet file**
Apche Parquet is an open source, column oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression.

In [104]:
df_paraquet = pd.read_parquet('Cars_Data.parquet')
df_paraquet.head()

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


## Saving a DataFrame

In [None]:
**1) Saving a DataFrame as CSV**

In [106]:
df1

Unnamed: 0,Name,Scores
0,Ankit,78
1,Shikha,91
2,Pankaj,84
3,Vibhor,75
4,Ujjwal,85
5,Kartik,93


In [108]:
df1.to_csv('df1_24Julu.csv')

**2) Saving a DataFrame as excel**

In [110]:
df2.head()

Unnamed: 0,Capital,Country,Geo Code
0,Delhi,India,1
1,Tokko,Japan,2
2,Spain,Madrid,3


In [114]:
df2.to_excel('df2_24July.xlsx')

**3) Saving a DataFrame as parquet file**

In [116]:
df3

Unnamed: 0,Domains,Marks
0,Python,89
1,DS,78
2,ML,82
3,Tableau,85
4,Excel,93


In [118]:
df3.to_parquet('df3_24July.parquet')