# Basics

---

This will be a brief introduction to **Pandas**, data analysis library in Python. More to go in the second lecture. However, during this quick intro to Pandas, we will cover one of the most important aspect in data analysis - **how to read and write different data** along with checking data shape and size.


### Lecture outline

---

* Read Data


* Write Data


* Data Size and Shape


* Summary Statistics


* Unique Observations


* Value Counts

#### Reference


[IO tools (text, CSV, HDF5, …)](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-tools-text-csv-hdf5)

[Pandas: How to Read and Write Files](https://realpython.com/pandas-read-write-files/)

In [1]:
import pandas as pd alias

## Read Data

---

Reading data file is THE FIRST operation you will do during data analysis. You have to be able to read different format of data file. Let start with the simplest common one, the CSV file.

In [4]:
csv_file = pd.read_csv("data/admission.csv")

csv_file.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [5]:
excel_file = pd.read_excel("data/titanic.xlsx")

excel_file.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [7]:
stata_file = pd.read_stata("data/airline.dta")

stata_file.head()

Unnamed: 0,year,y,w,r,l,k
0,1948,1.214,0.243,0.1454,1.415,0.612
1,1949,1.354,0.26,0.2181,1.384,0.559
2,1950,1.569,0.278,0.3157,1.388,0.573
3,1951,1.948,0.297,0.394,1.55,0.564
4,1952,2.265,0.31,0.3559,1.802,0.574


In [8]:
sas_file = pd.read_sas("data/alcohol.sas7bdat")

sas_file.head()

Unnamed: 0,ADULTS,KIDS,INCOME,CONSUME
0,2.0,2.0,758.0,1.0
1,2.0,3.0,1785.0,1.0
2,3.0,0.0,1200.0,1.0
3,1.0,0.0,545.0,1.0
4,4.0,1.0,547.0,1.0


In [9]:
spss_file = pd.read_spss("data/sleep.sav")

spss_file.head()

Unnamed: 0,id,sex,age,marital,edlevel,weight,height,healthrate,fitrate,weightrate,...,sleepy,energy,stayslprec,getsleprec,qualsleeprec,totsas,cigsgp3,agegp3,probsleeprec,drvslprec
0,83.0,female,42.0,married/defacto,secondary school,52.0,162.0,very good,7.0,5.0,...,2.0,2.0,,,"very good, excellent",10.0,6 - 15,38 - 50,no,no
1,294.0,female,54.0,married/defacto,postgraduate degree,65.0,174.0,8.0,7.0,5.0,...,5.0,5.0,yes,yes,good,20.0,,51+,no,no
2,425.0,male,,married/defacto,secondary school,89.0,170.0,6.0,5.0,7.0,...,6.0,5.0,yes,yes,"very poor, poor",31.0,,,no,yes
3,64.0,female,41.0,married/defacto,postgraduate degree,66.0,178.0,9.0,7.0,5.0,...,6.0,8.0,no,yes,good,34.0,<= 5,38 - 50,no,no
4,536.0,female,39.0,married/defacto,postgraduate degree,62.0,160.0,9.0,5.0,7.0,...,6.0,6.0,yes,yes,good,25.0,,38 - 50,no,no


In [10]:
json_file = pd.read_json("data/example.json")

json_file.head()

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


## Write Data

---

Writing data into a file is almost same procedure as reading. While writing data in a file, we have to indicate the address of new data file as well as the name of the file. Let write CSV and Excel file. Other formats are almost same.

In [11]:
csv_file.to_csv("data/new_csv_file.csv")

In [None]:
excel_file.to_excel("data/new_excel_file.xlsx")

## Data Size and Shape

---

We will talk about data size and shape in the second lecture. However, here we quickly cover what is it and how to use that information.

In [12]:
csv_file.size # Returns number of elements in DataFrame

3600

In [13]:
csv_file.shape # Retruns number of rows and columns, respectively

(400, 9)

## Summary Statistics

---

This is a summary statistics of your data. This gives you the quick sight of your data at hand.

In [14]:
csv_file.describe()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
count,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0
mean,200.5,316.8075,107.41,3.0875,3.4,3.4525,8.598925,0.5475,0.72435
std,115.614301,11.473646,6.069514,1.143728,1.006869,0.898478,0.596317,0.498362,0.142609
min,1.0,290.0,92.0,1.0,1.0,1.0,6.8,0.0,0.34
25%,100.75,308.0,103.0,2.0,2.5,3.0,8.17,0.0,0.64
50%,200.5,317.0,107.0,3.0,3.5,3.5,8.61,1.0,0.73
75%,300.25,325.0,112.0,4.0,4.0,4.0,9.0625,1.0,0.83
max,400.0,340.0,120.0,5.0,5.0,5.0,9.92,1.0,0.97


## Unique Observations

---

Summary statistics does not give how many unique observations we have alongside columns. We can check it by using `.unique()` method.

In [15]:
csv_file.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [20]:
csv_file["University Rating"].unique()

array([4, 3, 2, 5, 1])

In [16]:
csv_file["Research"].unique() # Only two unique values in "Research" column

array([1, 0])

## Value Counts

---

We can count duplicated values across columns.

In [21]:
csv_file.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [23]:
csv_file["University Rating"].value_counts()

3    133
2    107
4     74
5     60
1     26
Name: University Rating, dtype: int64

In [22]:
csv_file["Research"].value_counts() # We have 219 ones and 181 zeros, totaling to 400

1    219
0    181
Name: Research, dtype: int64

# Summary

---

In this lecture, we learn how to set up our working environment as well as how to install necessary libraries for data analysis. Moreover, we have covered one of the most important aspect of data analysis - data reading and writing. In the next lecture, we will uncover `Pandas` capabilities.