<h1 align=center><font size = 5>DATA FRAMES IN PYTHON</font></h1>


## Table of Contents


<div class="alert alert-block alert-info" style="margin-top: 20px">
<li><a href="#ref0">About the Dataset</a></li>
<li><a href="#ref3">Data Frames</a></li>
</div>

<hr>

<a id="ref0"></a>
<center><h2>About the Dataset</h2></center>

Imagine you got many album recommendations from your friends and compiled all of the recommendations in a table, with specific info about each album.

The table has one row for each album and several columns

- **artist** - Name of the artist
- **album** - Name of the album
- **released_year** - Year the album was released
- **length_min_sec** - Length of the album (hours,minutes,seconds)
- **genre** - Genre of the album
- **music_recording_sales_millions** - Music recording sales (millions in USD)
- **claimed_sales_millions** - Album's claimed sales (millions in USD)
- **date_released** - Date on which the album was released
- **soundtrack** - Indicates if the album is the movie soundtrack (Y) or (N)
- **rating_of_friends** - Indicates the rating from your friends from 1 to 10
<br>
<br>

The dataset can be seen below:

In [1]:
import pandas as pd
bigmart = pd.read_csv("dataset/music_dataset.csv")
bigmart.head()


Unnamed: 0,artist,album,released_year,length_min_sec,genre,music_recording_sales_millions,claimed_sales_millions,date_released,soundtrack,rating_of_friends
0,Michael Jackson,Thriller,1982,42:19:00,"Pop, rock, R&B",46.0,65,30/11/82,N,10.0
1,AC/DC,Back in Black,1980,42:11:00,Hard rock,26.1,50,25/07/80,N,8.5
2,Pink Floyd,The Dark Side of the Moon,1973,42:49:00,Prigressive rock,24.2,45,01/03/73,N,9.5
3,Whtney Houston,The Bodyguard,1992,57:44:00,"R&B, soul, pop",27.4,44,17/11/92,Y,7.5
4,Meat Loaf,Bat Out of Hell,1977,46:33:00,"Hard rock, progressive rock",20.6,43,21/10/77,N,7.0


<hr>

<a id="ref3"></a>
<center><h2>Data Frames</h2></center>

A Data frame is a structure that is used for storing data tables. Underneath it all, a data frame is a list of arrays of the same length, exactly like a table (each array is a column). We call a pandas function called **DataFrame** to create a data frame and pass array information, which are our columns, as arguments. It is required to name the columns that will compose the data frame.

 First we must import the pandas library  

In [2]:
import pandas as pd

We would like to create  a date frame in the following form:

 <a ><img src = https://ibm.box.com/shared/static/486idtdd4oxa8laltxk7gbn8l2vp2hq3.png width = 700, align = "center"></a>
  <h4 align=center> Figure 1:An example of a DataFrame
  </h4>

In [3]:

songs = {'Album' : ['Thriller','Back in Black', 'The Dark Side of the Moon',\
                    'The Bodyguard','Bat Out of Hell'],
         'Released' : [1982,1980,1973,1992,1977],
         'Length' : ['00:42:19','00:42:11' ,'00:42:49','00:57:44','00:46:33']}
songs_frame = pd.DataFrame(songs) 

Let's print the content of our recently created data frame:

In [4]:
songs_frame.head(2)

Unnamed: 0,Album,Released,Length
0,Thriller,1982,00:42:19
1,Back in Black,1980,00:42:11


You can note how it looks like a table.
Now let us access a column of the dataframe named 'Album'.

In [5]:
songs_frame['Album']

0                     Thriller
1                Back in Black
2    The Dark Side of the Moon
3                The Bodyguard
4              Bat Out of Hell
Name: Album, dtype: object

Corresponding column is Highlighted in Figure 2: 

 <a ><img src = https://ibm.box.com/shared/static/03ifetpnpiza7vn0gyf2qtvsdbgwrvxf.png width = 700, align = "center"></a>
  <h4 align=center> Figure 2: An example of a DataFrame with the column 'Album' indicted in yellow.  

  </h4>

We can retrieve the second column as well:

 
<a ><img src = https://ibm.box.com/shared/static/judo6tn30dd9w61bmit5vn600ln2hzwq.png​ width = 700, align = "center"></a>
  <h4 align=center> Figure 3: An example of a DataFrame with the colunm 'Released' indicted in yellow.  
  </h4>

You can retrieve the column “ Released ” in multiple ways:

In [6]:
# This returns the first (1st) column
songs_frame.Released

0    1982
1    1980
2    1973
3    1992
4    1977
Name: Released, dtype: int64

In [7]:
songs_frame['Released']

0    1982
1    1980
2    1973
3    1992
4    1977
Name: Released, dtype: int64

We can use the **describe()** method to compute the set of summary statistics for the DataFrame:

In [8]:
songs_frame.describe()

Unnamed: 0,Released
count,5.0
mean,1980.8
std,7.120393
min,1973.0
25%,1977.0
50%,1980.0
75%,1982.0
max,1992.0


The output shows that this data frame has 5 observations, for the second column, called **Released**.  

We can determine the type of a column of a data frame.

In [9]:
songs_frame.dtypes

Album       object
Released     int64
Length      object
dtype: object

 You can select the rows by using “slicing”. Let's say you would like to select the first three rows. 


<a ><img src = https://ibm.box.com/shared/static/8b5lb73hemmyenw261ac2nms7xhitwrz.png width = 700, align = "center"></a>
  <h4 align=center> Figure 4: An example of a DataFrame with Row 0,1,2 colored in yellow.

  </h4>

In [10]:
songs_frame[:2]

Unnamed: 0,Album,Released,Length
0,Thriller,1982,00:42:19
1,Back in Black,1980,00:42:11


 Lets say we would like to create a new dataframe "New_songs_frame" that consists of rows  1,2,3


 <a ><img src = https://ibm.box.com/shared/static/fvdiqz4tyfngvcz95wz16koif2b71848.png width = 700, align = "center"></a>
  <h4 align=center> Figure 5: An example of a DataFrame with Row 1,2,3 colored in yellow.

  </h4>

The "New_songs_frame" can be determined as follows:

In [11]:
New_songs_frame=songs_frame[1:4]
New_songs_frame

Unnamed: 0,Album,Released,Length
1,Back in Black,1980,00:42:11
2,The Dark Side of the Moon,1973,00:42:49
3,The Bodyguard,1992,00:57:44


The **_head()_** function is very useful when you have a large table and you need to take a peek at the first elements. This function returns the first 6 values of a data frame (or event a list).Now let us take a look at the first 3 elements of the data fame.

In [12]:
songs_frame.head(3)

Unnamed: 0,Album,Released,Length
0,Thriller,1982,00:42:19
1,Back in Black,1980,00:42:11
2,The Dark Side of the Moon,1973,00:42:49


Similar to the previous function, **_tail()_** returns the last 6 values of a data frame or list.Now let us take a look at the last 2 elements of the data frame.

In [13]:
songs_frame.tail(n=2)

Unnamed: 0,Album,Released,Length
3,The Bodyguard,1992,00:57:44
4,Bat Out of Hell,1977,00:46:33


Now let's try to add a new column 'Soundtrack' to our data frame.

In [14]:
songs_frame['Soundtrack'] = songs_frame.Album == 'The Bodyguard'
songs_frame

Unnamed: 0,Album,Released,Length,Soundtrack
0,Thriller,1982,00:42:19,False
1,Back in Black,1980,00:42:11,False
2,The Dark Side of the Moon,1973,00:42:49,False
3,The Bodyguard,1992,00:57:44,True
4,Bat Out of Hell,1977,00:46:33,False


A new column was included into our data frame with just one line of code. 

We don't need this album anymore, let's delete it. 

In [15]:
del songs_frame['Soundtrack']
songs_frame

Unnamed: 0,Album,Released,Length
0,Thriller,1982,00:42:19
1,Back in Black,1980,00:42:11
2,The Dark Side of the Moon,1973,00:42:49
3,The Bodyguard,1992,00:57:44
4,Bat Out of Hell,1977,00:46:33


That's it! 

---