# PANDAS TUTORIAL part 1

2018-12-06 12:13:00 WIB

This tutorial is about how to create dataset


## Create Data
We begin by creating our own data set for analysis. This prevents the end user reading this tutorial from having to download any files to replicate the results below. We will export this data set to a text file so that you can get some experience pulling data from a text file.

## Get Data 
We will learn how to read in the text file. The data consist of baby names and the number of baby names born in the year 1880.

## Prepare Data 
Here we will simply take a look at the data and make sure it is clean. By clean I mean we will take a look inside the contents of the text file and look for any anomalities. These can include missing data, inconsistencies in the data, or any other data that seems out of place. If any are found we will then have to make decisions on what to do with these records.

## Analyze Data 
We will simply find the most popular name in a specific year.

## Present Data 
Through tabular data and a graph, clearly show the end user what is the most popular name in a specific year.

The Pandas library is used for all the data analysis excluding a small piece of the data presentation section. The Matplotlib library will only be needed for the data presentation section. Importing the libraries is the first step we will take in the lesson.

In [55]:
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

## Create Data

In [36]:
Names = ["John Lydon", "Glen Matlock", "Steve Jones", "Paul Cook", "Sid Vicious"]
Positions = ["Vocalist", "Bassist", "Guitarist", "Drummer", "Bassist"]
Age = [19, 19, 20, 19, 18]

In [37]:
SexPistols_DS = list(zip(Names, Positions, Age))
SexPistols_DS

[('John Lydon', 'Vocalist', 19),
 ('Glen Matlock', 'Bassist', 19),
 ('Steve Jones', 'Guitarist', 20),
 ('Paul Cook', 'Drummer', 19),
 ('Sid Vicious', 'Bassist', 18)]

Import Dataset to CSV file
1. create a DataFrame, DataFrame is ...
2. create csv file using DataFrame that has been created

In [38]:
df = pd.DataFrame(data = SexPistols_DS, columns = ("Names", "Positions", "Age"))
df

Unnamed: 0,Names,Positions,Age
0,John Lydon,Vocalist,19
1,Glen Matlock,Bassist,19
2,Steve Jones,Guitarist,20
3,Paul Cook,Drummer,19
4,Sid Vicious,Bassist,18


In [39]:
#df.to_csv("SexPistols.csv", index = False, header = False) 
# create the csv file in the same path as notebook file

df.to_csv("/home/pranatha/git_workspace/Tutorials/DataScience/DataWrangling/Frameworks/Inputs/SexPistols.csv", index = False, header = False)
# create the csv file in specific path

## Get Data

In [40]:
Location = r"/home/pranatha/git_workspace/Tutorials/DataScience/DataWrangling/Frameworks/Inputs/SexPistols.csv"
df = pd.read_csv(Location)
df

Unnamed: 0,John Lydon,Vocalist,19
0,Glen Matlock,Bassist,19
1,Steve Jones,Guitarist,20
2,Paul Cook,Drummer,19
3,Sid Vicious,Bassist,18


Fixing Header
1. set default value using header parameter
or
2. set specific value using names parameter

In [41]:
df = pd.read_csv(Location, header = None)
df

Unnamed: 0,0,1,2
0,John Lydon,Vocalist,19
1,Glen Matlock,Bassist,19
2,Steve Jones,Guitarist,20
3,Paul Cook,Drummer,19
4,Sid Vicious,Bassist,18


In [42]:
df = pd.read_csv(Location, names = ["Names", "Positions", "Age"])
df

Unnamed: 0,Names,Positions,Age
0,John Lydon,Vocalist,19
1,Glen Matlock,Bassist,19
2,Steve Jones,Guitarist,20
3,Paul Cook,Drummer,19
4,Sid Vicious,Bassist,18


## Prepare Data

Preparing the data consist of checking the data types and also if there is missing values and/or outliers

In [43]:
df.dtypes

Names        object
Positions    object
Age           int64
dtype: object

## Analize Data

In [44]:
Sorted_Age = df.sort_values(["Age"])
Sorted_Age

Unnamed: 0,Names,Positions,Age
4,Sid Vicious,Bassist,18
0,John Lydon,Vocalist,19
1,Glen Matlock,Bassist,19
3,Paul Cook,Drummer,19
2,Steve Jones,Guitarist,20


In [45]:
df["Age"].min()

18

In [46]:
df["Age"].max()

20

In [47]:
df["Age"].mean()

19.0

In [48]:
df["Age"].median()

19.0

In [49]:
df["Age"].mode()

0    19
dtype: int64

In [50]:
Sorted_Age

Unnamed: 0,Names,Positions,Age
4,Sid Vicious,Bassist,18
0,John Lydon,Vocalist,19
1,Glen Matlock,Bassist,19
3,Paul Cook,Drummer,19
2,Steve Jones,Guitarist,20


In [51]:
Min_Age = df["Age"].min()
Max_Age = df["Age"].max()
Sorted_Age["Age"].between(Min_Age, Max_Age, inclusive = False)

4    False
0     True
1     True
3     True
2    False
Name: Age, dtype: bool

## Present Data

In [149]:
Age = df.sort_values(["Age"])
sortedAge = Age["Age"]
sortedAge

AgeSeries = pd.Series(df["Age"])
AgeSeries
sortedAge = AgeSeries.sort_values()
sortedAge
uniqueAge = AgeSeries.unique()
uniqueSeries = pd.Series(uniqueAge)
uniqueSeries
sortUniqueAge = uniqueSeries.sort_values()
sortUniqueAge
#countedAge = sortedAge.value_counts()
#countedAge = AgeSeries.value_counts()
#countedAgeSeries = pd.Series(countedAge)
#countedAgeSeries
#Age = df["Age"].nunique()

#AgeCount = sortedAge.value_counts()
#AgeCount
#AgeCount.sort_values()
#plt.plot(Age["Age"], )
#plt.plot(sortedAge, AgeCount)
#plt.plot(AgeCount, )
#plt.plot([18,19,20], [1,3,1])
#plt.plot(Age["Names"], Age["Age"])
#df["Age"].value_counts().plot(kind = "barh")
#plt.show()

2    18
0    19
1    20
dtype: int64