In [1]:
for i in range(20):
    print(i*3)
    

0
3
6
9
12
15
18
21
24
27
30
33
36
39
42
45
48
51
54
57


## We will see below how to load data from a CSV file and do some simple data manipulations using *pandas* .

### Load some necessary libraries to start

In [2]:
import pandas as pd

### Load the whole data set, which is given here as a CSV (comma separated values) file. 
### Data can be given in other ways too (json, a URL, a database conncection, excel file etc.)

In [3]:
irisData= pd.read_csv("iris_dataset_with_class_information.csv")

In [4]:
irisData.shape  # Number of rows and columns - note that header row is not counted

(150, 5)

In [5]:
print(type(irisData))   # the loaded data is a DataFrame object in Pandas

<class 'pandas.core.frame.DataFrame'>


In [6]:
irisData

Unnamed: 0,sepal.length.in.cm,sepal.width.in.cm,petal.length.in.cm,petal.width.in.cm,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


### *Note that row numbers start from 0. Pandas also added an index column at the far left*
Pandas displays only 20 columns by default for wide data dataframes, and only 60 or so rows, truncating the middle section. If you’d like to change these limits, you can edit the defaults using pandas option flags (use *pd.get_option() to get and set these*):
* pd.get_option("display.max_rows") ## ---> displays current value of the option
* pd.set_option("display.max_rows",200)  ## ---> sets the value of the option
* pd.reset_option("display.max_rows") ## resets to default 

Full set of options available in the official Pandas options and settings [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html)

## DataFrame objects
When pandas loads csv data from a file, the data is stored in a spreadsheet-like datastructue called a DataFrame. A dataframe has rows and columns and the columns could be of different types - one column of numeric values, another of categorical data, a boolean column etc.  

In [7]:
X1 = irisData["sepal.length.in.cm"] # just pick the one column titled "sepal.length.in.cm" - See **subsetting** below.

In [8]:
X1

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal.length.in.cm, Length: 150, dtype: float64

In [9]:
X2= irisData[["sepal.length.in.cm",  "sepal.width.in.cm"]] ## pick two columns - see a list inside the subscript

In [10]:
X2

Unnamed: 0,sepal.length.in.cm,sepal.width.in.cm
0,5.1,3.5
1,4.9,3.0
2,4.7,3.2
3,4.6,3.1
4,5.0,3.6
...,...,...
145,6.7,3.0
146,6.3,2.5
147,6.5,3.0
148,6.2,3.4


### Subsetting a dataframe

Quite often you will have to pick apart some rows and columns of a dataframe. This is called **subsetting**.
Subsetting is done by passing column names or row/column numbers as a range.
When specifying range, 
* m:n means starting at m, going up to but *not including* n.
* A range of m: means from m to the end, :m means 0 to m, including 0 but not including m

In [11]:
irisData.iloc[0:10,  0:5]  # first range is that of rows, second is range of columns

Unnamed: 0,sepal.length.in.cm,sepal.width.in.cm,petal.length.in.cm,petal.width.in.cm,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


In [12]:
irisData.iloc[:10,  :5]  # Same as above, but starting 0 omitted

Unnamed: 0,sepal.length.in.cm,sepal.width.in.cm,petal.length.in.cm,petal.width.in.cm,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


In [13]:
irisData.iloc[145:,  2:] #Rows starting from 145 to the end, columns starting from 2 to the end. Remember row and column numbers begin at 0

Unnamed: 0,petal.length.in.cm,petal.width.in.cm,species
145,5.2,2.3,virginica
146,5.0,1.9,virginica
147,5.2,2.0,virginica
148,5.4,2.3,virginica
149,5.1,1.8,virginica


### Shuffling the data
Since the data is grouped by species, if you want to take a representative random sample, you need to first shuffle the rows. This is usually a good practice before you subset any data set for analysis. 

X.sample() samples the data rows, which is the same as shuffling the rows. You can get a partial or full sample. The **frac=1** argument to sample() function returns all the rows shuffled randomly without replacement.

**random_state** option to smaple sets the seed of the random number generator used. For reproducible results. If not set, every run will result in different samples.

Shuffling the rows also shuffles the index column on the far left. You can re-index the rows with the **reset_index()** in the command below. 


In [14]:
X = irisData.sample(frac=1, random_state=12).reset_index(drop=True) # shuffle, index column renumbered

In [15]:
X  # Same data as before (as originally loaded), but the rows have been shuffled randomly

Unnamed: 0,sepal.length.in.cm,sepal.width.in.cm,petal.length.in.cm,petal.width.in.cm,species
0,5.0,3.5,1.3,0.3,setosa
1,6.3,2.5,5.0,1.9,virginica
2,4.4,3.0,1.3,0.2,setosa
3,5.7,2.8,4.1,1.3,versicolor
4,6.8,3.2,5.9,2.3,virginica
...,...,...,...,...,...
145,6.8,2.8,4.8,1.4,versicolor
146,4.6,3.1,1.5,0.2,setosa
147,7.4,2.8,6.1,1.9,virginica
148,6.1,2.6,5.6,1.4,virginica


### Smaller Samples
You can take smaller samples from the whole data by changing the value of the frac argument to the fraction you desire. 
To take a sample of 1/2 the size (75):

In [16]:
X2 = irisData.sample(frac=.5, random_state=12).reset_index(drop=True)

In [17]:
X2.shape

(75, 5)

In [18]:
X2

Unnamed: 0,sepal.length.in.cm,sepal.width.in.cm,petal.length.in.cm,petal.width.in.cm,species
0,5.0,3.5,1.3,0.3,setosa
1,6.3,2.5,5.0,1.9,virginica
2,4.4,3.0,1.3,0.2,setosa
3,5.7,2.8,4.1,1.3,versicolor
4,6.8,3.2,5.9,2.3,virginica
...,...,...,...,...,...
70,7.6,3.0,6.6,2.1,virginica
71,5.1,3.5,1.4,0.2,setosa
72,6.7,3.1,5.6,2.4,virginica
73,7.9,3.8,6.4,2.0,virginica


### Renaming Columns
You can rename columns of a dataframe in multiple ways. here is one way: 


In [24]:
irisData.rename(columns={"sepal.length.in.cm":"SepalLen","sepal.width.in.cm":"SepalWid","petal.length.in.cm":"PetalLen","petal.width.in.cm":"PetalWid"}, inplace=True)

In [25]:
irisData

Unnamed: 0,SepalLen,SepalWid,PetalLen,PetalWid,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


### In the above case, see what happens if you didn't have the parameter *inplace=True*