# 

# Exploring Data Arrays with Numpy

In [2]:
data = [50,50,47,97,49,3,53,42,26,74,82,62,37,15,70,27,36,35,48,52,63,64]
print(data)

[50, 50, 47, 97, 49, 3, 53, 42, 26, 74, 82, 62, 37, 15, 70, 27, 36, 35, 48, 52, 63, 64]


##### List [ ], is an Ordered collection of similar or different types of items separated by commas and enclose within brackets. To access list Items, we use the index number starting from 0, 1, 2, ...

##### The data has been loaded into a python list structure, which is a good data type for general data manipulation, but it's not optimized for numeric analysis. For that, we're going to use the Numpy package, which includes specific data types and functions for working with Numbers in python. 

## load the data into a Numpy array

In [3]:
import numpy as np
grades = np.array(data)
print(grades)

[50 50 47 97 49  3 53 42 26 74 82 62 37 15 70 27 36 35 48 52 63 64]


##### Just in case you're wondering about the differences between a list and a NumPy array, let's compare how these data types behave when we use them in an expression that multiplies them by two.

In [4]:
print (type(data),'x 2:', data * 2)
print('---')
print (type(grades),'x 2:', grades * 2)

<class 'list'> x 2: [50, 50, 47, 97, 49, 3, 53, 42, 26, 74, 82, 62, 37, 15, 70, 27, 36, 35, 48, 52, 63, 64, 50, 50, 47, 97, 49, 3, 53, 42, 26, 74, 82, 62, 37, 15, 70, 27, 36, 35, 48, 52, 63, 64]
---
<class 'numpy.ndarray'> x 2: [100 100  94 194  98   6 106  84  52 148 164 124  74  30 140  54  72  70
  96 104 126 128]


##### type() is used to know which class a certain variable belongs

##### The key takeaway from this is that NumPy arrays are specifically designed to support mathematical operations on numeric data, which makes them more useful for data analysis than a generic list.

##### You might have spotted that the class type for the preceding NumPy array is numpy.ndarray. The nd indicates that this is a structure that can consist of multiple dimensions. (It can have n dimensions.) Our specific instance has a single dimension of student grades.

#  View the shape of the array.

In [5]:
grades.shape

(22,)

In [6]:
grades[0]

50

##### Now that you know your way around a NumPy array, it's time to perform some analysis of the grades data.

##### You can apply aggregations across the elements in the array, so let's find the simple average grade (in other words, the mean grade value).

In [7]:
grades.mean()

49.18181818181818

##### Let's add a second set of data for the same students.

In [10]:
# Define an array of study hours
study_hours = [10.0,11.5,9.0,16.0,9.25,1.0,11.5,9.0,8.5,14.5,15.5,
            13.75,9.0,8.0,15.5,8.0,9.0,6.0,10.0,12.0,12.5,12.0]

# Create a 2D array (an array of arrays)
student_data = np.array([study_hours, grades])

# display the array
student_data

array([[10.  , 11.5 ,  9.  , 16.  ,  9.25,  1.  , 11.5 ,  9.  ,  8.5 ,
        14.5 , 15.5 , 13.75,  9.  ,  8.  , 15.5 ,  8.  ,  9.  ,  6.  ,
        10.  , 12.  , 12.5 , 12.  ],
       [50.  , 50.  , 47.  , 97.  , 49.  ,  3.  , 53.  , 42.  , 26.  ,
        74.  , 82.  , 62.  , 37.  , 15.  , 70.  , 27.  , 36.  , 35.  ,
        48.  , 52.  , 63.  , 64.  ]])

In [11]:
# Show shape of 2D array
student_data.shape

(2, 22)

##### To find the first value in the first array (which contains the study hours data), you can use the following code.

In [12]:
# Show the first element of the first element
student_data[0][0]

10.0

In [13]:
# Get the mean value of each sub-array
avg_study = student_data[0].mean()
avg_grade = student_data[1].mean()

print('Average study hours: {:.2f}\nAverage grade: {:.2f}'.format(avg_study, avg_grade))

Average study hours: 10.52
Average grade: 49.18


##### It contains placeholders ({:.2f}) for inserting values. The placeholders are formatted to display floating-point numbers with two decimal places (.2f).
##### \n: This is a newline character, meaning it will break the text into two lines.
##### .format(avg_study, avg_grade): This method inserts the values of avg_study and avg_grade into the placeholders {:.2f}. avg_study corresponds to the first {:.2f} and avg_grade corresponds to the second.

# Exploring Tabular Data With Pandas

##### NumPy provides a lot of the functionality and tools you need to work with numbers, such as arrays of numeric values. However, when you start to deal with two-dimensional tables of data, the Pandas package offers a more convenient structure to work with: the DataFrame. Run the following cell to import the Pandas library and create a DataFrame with three columns. The first column is a list of student names, and the second and third columns are the NumPy arrays containing the study time and grade data.

In [17]:
import pandas as pd
df_students = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic', 'Jimmie', 
                                    'Rhonda', 'Giovanni', 'Francesca', 'Rajab', 'Naiyana', 'Kian', 'Jenny',
                                    'Jakeem','Helena','Ismat','Anila','Skye','Daniel','Aisha'],
                            'StudyHours':student_data[0],
                            'Grade':student_data[1]})
df_students
# Note that in addition to the columns you specified, the DataFrame includes an index to uniquely identify each row. We could've specified the index explicitly and assigned any kind of appropriate value (for example, an email address). However, because we didn't specify an index, one has been created with a unique integer value for each row.

Unnamed: 0,Name,StudyHours,Grade
0,Dan,10.0,50.0
1,Joann,11.5,50.0
2,Pedro,9.0,47.0
3,Rosie,16.0,97.0
4,Ethan,9.25,49.0
5,Vicky,1.0,3.0
6,Frederic,11.5,53.0
7,Jimmie,9.0,42.0
8,Rhonda,8.5,26.0
9,Giovanni,14.5,74.0


##### To specify an index explicitly when creating the DataFrame, you can pass the index parameter in the pd.DataFrame() function. You can assign any appropriate values to the index, such as unique IDs, email addresses, or any other identifier.For example, if you want to explicitly specify the index as email addresses, you can modify your code like this:

In [18]:
import pandas as pd
df_students = pd.DataFrame(
    {'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vicky', 'Frederic', 'Jimmie', 
            'Rhonda', 'Giovanni', 'Francesca', 'Rajab', 'Naiyana', 'Kian', 'Jenny',
            'Jakeem', 'Helena', 'Ismat', 'Anila', 'Skye', 'Daniel', 'Aisha'],
    'StudyHours': student_data[0],
    'Grade': student_data[1]},
    index=['dan@example.com', 'joann@example.com', 'pedro@example.com', 'rosie@example.com', 
        'ethan@example.com', 'vicky@example.com', 'frederic@example.com', 'jimmie@example.com',
        'rhonda@example.com', 'giovanni@example.com', 'francesca@example.com', 'rajab@example.com', 
'naiyana@example.com', 'kian@example.com', 'jenny@example.com', 'jakeem@example.com',
'helena@example.com', 'ismat@example.com', 'anila@example.com', 'skye@example.com', 
'daniel@example.com', 'aisha@example.com']  # Explicit index (email addresses)
)

df_students


Unnamed: 0,Name,StudyHours,Grade
dan@example.com,Dan,10.0,50.0
joann@example.com,Joann,11.5,50.0
pedro@example.com,Pedro,9.0,47.0
rosie@example.com,Rosie,16.0,97.0
ethan@example.com,Ethan,9.25,49.0
vicky@example.com,Vicky,1.0,3.0
frederic@example.com,Frederic,11.5,53.0
jimmie@example.com,Jimmie,9.0,42.0
rhonda@example.com,Rhonda,8.5,26.0
giovanni@example.com,Giovanni,14.5,74.0


# Finding and filtering data in a DataFrame

##### You can use the DataFrame's loc method to retrieve data for a specific index value, like this.

In [22]:
df_students.loc["vicky@example.com"]

Name          Vicky
StudyHours      1.0
Grade           3.0
Name: vicky@example.com, dtype: object

##### You can also get the data at a range of index values, like this:

In [23]:
df_students.loc["dan@example.com":"jimmie@example.com"]

Unnamed: 0,Name,StudyHours,Grade
dan@example.com,Dan,10.0,50.0
joann@example.com,Joann,11.5,50.0
pedro@example.com,Pedro,9.0,47.0
rosie@example.com,Rosie,16.0,97.0
ethan@example.com,Ethan,9.25,49.0
vicky@example.com,Vicky,1.0,3.0
frederic@example.com,Frederic,11.5,53.0
jimmie@example.com,Jimmie,9.0,42.0


##### In addition to being able to use the loc method to find rows based on the index, you can use the iloc method to find rows based on their ordinal position in the DataFrame (regardless of the index):


In [27]:
#Get data in the first five rows
df_students.iloc[0:5]

Unnamed: 0,Name,StudyHours,Grade
dan@example.com,Dan,10.0,50.0
joann@example.com,Joann,11.5,50.0
pedro@example.com,Pedro,9.0,47.0
rosie@example.com,Rosie,16.0,97.0
ethan@example.com,Ethan,9.25,49.0


##### The loc method returned rows with index label in the list of values from 0 to 5, which includes 0, 1, 2, 3, 4, and 5 (six rows). However, the iloc method returns the rows in the positions included in the range 0 to 5. Since integer ranges don't include the upper-bound value, this includes positions 0, 1, 2, 3, and 4 (five rows).

##### iloc identifies data values in a DataFrame by position, which extends beyond rows to columns. So, for example, you can use it to find the values for the columns in positions 1 and 2 in row 0, like this:

In [29]:
df_students.iloc[4,[0,2]]

Name     Ethan
Grade     49.0
Name: ethan@example.com, dtype: object

In [34]:
df_students.loc["ethan@example.com", ["Name", "Grade"]]

Name     Ethan
Grade     49.0
Name: ethan@example.com, dtype: object

In [35]:
df_students["Name"] == "Ethan"

dan@example.com          False
joann@example.com        False
pedro@example.com        False
rosie@example.com        False
ethan@example.com         True
vicky@example.com        False
frederic@example.com     False
jimmie@example.com       False
rhonda@example.com       False
giovanni@example.com     False
francesca@example.com    False
rajab@example.com        False
naiyana@example.com      False
kian@example.com         False
jenny@example.com        False
jakeem@example.com       False
helena@example.com       False
ismat@example.com        False
anila@example.com        False
skye@example.com         False
daniel@example.com       False
aisha@example.com        False
Name: Name, dtype: bool

##### Here's another useful trick. You can use the following method to find indexed rows based on a filtering expression that references named columns other than the index, like this: Remember in our case, we explicitly defined the indexes as emails

In [39]:
df_students[df_students['Name']=='Ethan']

Unnamed: 0,Name,StudyHours,Grade
ethan@example.com,Ethan,9.25,49.0


##### And for good measure, you can achieve the same results by using the DataFrame's query method, like this:

In [40]:
df_students.query("Name == 'Ethan'")

Unnamed: 0,Name,StudyHours,Grade
ethan@example.com,Ethan,9.25,49.0


In [41]:
df_students["Name"]

dan@example.com                Dan
joann@example.com            Joann
pedro@example.com            Pedro
rosie@example.com            Rosie
ethan@example.com            Ethan
vicky@example.com            Vicky
frederic@example.com      Frederic
jimmie@example.com          Jimmie
rhonda@example.com          Rhonda
giovanni@example.com      Giovanni
francesca@example.com    Francesca
rajab@example.com            Rajab
naiyana@example.com        Naiyana
kian@example.com              Kian
jenny@example.com            Jenny
jakeem@example.com          Jakeem
helena@example.com          Helena
ismat@example.com            Ismat
anila@example.com            Anila
skye@example.com              Skye
daniel@example.com          Daniel
aisha@example.com            Aisha
Name: Name, dtype: object

In [42]:
df_students.Name == "Ethan"

dan@example.com          False
joann@example.com        False
pedro@example.com        False
rosie@example.com        False
ethan@example.com         True
vicky@example.com        False
frederic@example.com     False
jimmie@example.com       False
rhonda@example.com       False
giovanni@example.com     False
francesca@example.com    False
rajab@example.com        False
naiyana@example.com      False
kian@example.com         False
jenny@example.com        False
jakeem@example.com       False
helena@example.com       False
ismat@example.com        False
anila@example.com        False
skye@example.com         False
daniel@example.com       False
aisha@example.com        False
Name: Name, dtype: bool

##### The above examples underline a confusing truth about working with Pandas. Often, there are multiple ways to achieve the same results. 