<a href="https://colab.research.google.com/github/StanStarishko/python-programming-for-data/blob/main/Worksheets/2_describing_and_interrogating_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Describing and Interrogating Data
---

When using pandas we first need to import it

` import pandas as pd `


When first looking at a dataset, it is important to be able to see information about the data such as summmary statistics, and interrogate it to find information.

This is the least risky in terms of bias and inaccurate conclusions as it should focus just on what data is presented to us.

### Summary Statistics
---

Mean - the average  
Median - middle value / 50% of the data (another type of average)  
Mode - the value that appears the most frequently  
Range - the total range of values (max - min)    

**Functions:**

`mean()` - mean (average)   
`mode()` - mode    
`std()` - standard deviation   
`min()` - minimum value of column     
`max()` - maximum value of column  
`median()` - middle value (median)

### Useful Functions
---
 `head()` will show the first 5 rows of the dataframe. You can show a different amount of rows by putting the number of columns you would like to see in the brackets.    
`tail()` same as head() but for the last 5 rows   
`info()` will show information about the overall dataset, including how many Null values exist in each column, the data type of each column and dataframes length   
`describe()` will show summary statistics for all numeric columns   
`iloc[index]`  will show you row / rows at index position or in index range  
`unique()` will show all the unique values in a column   
`nunique()` will show the number of unique values in a column   
`len()` will show the length (can be used on a list, array, column etc)
`shape()` will show the number of rows and number of columns _e.g. (100, 15)_

### Interrogating dataframes
---

To view subsets:

* single column: `dataframe['column'] `
* multiple columns: `dataframe[['column1', 'column2']]`
* columns by criteria: `dataframe[dataframe['column'] == 'criteria']`
* multiple conditions   
`dataframe[(dataframe['column'] == condition1) & (dataframe['column'] == condition2)]`


## Data Retrieval
---

In order to load in a dataset you will need to retrieve it. The following code retrieves different types of data.

From a webpage:

` pd.read_html("url")`

From a CSV hosted on Github:

`pd.read_csv("url")`

From an Excel hosted on Github:

`pd.read_excel("url", sheet_name = "sheet name")`




### Exercise 1 - open the Titanic dataset and see descriptive info
---

The Titanic dataset is stored at this URL:
https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv

1. Read the dataset into a pandas dataframe that you will call **titanic**.


2. Write a function called **summary** that will:
* Display the first 5 rows of the dataset
* Use info() to display a technical summary of the data
* Use describe() to display a numerical summary of the data


**Expected Output**

```
   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
3            4         1       1  ...  53.1000  C123         S
4            5         0       3  ...   8.0500   NaN         S

[5 rows x 12 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object
 4   Sex          891 non-null    object
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object
 11  Embarked     889 non-null    object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
       PassengerId    Survived      Pclass  ...       SibSp       Parch        Fare
count   891.000000  891.000000  891.000000  ...  891.000000  891.000000  891.000000
mean    446.000000    0.383838    2.308642  ...    0.523008    0.381594   32.204208
std     257.353842    0.486592    0.836071  ...    1.102743    0.806057   49.693429
min       1.000000    0.000000    1.000000  ...    0.000000    0.000000    0.000000
25%     223.500000    0.000000    2.000000  ...    0.000000    0.000000    7.910400
50%     446.000000    0.000000    3.000000  ...    0.000000    0.000000   14.454200
75%     668.500000    1.000000    3.000000  ...    1.000000    0.000000   31.000000
max     891.000000    1.000000    3.000000  ...    8.000000    6.000000  512.329200

[8 rows x 7 columns]
```

In [1]:
import pandas as pd

def is_valid_link(link="",link_name="",autotest=False):
  # link always isn't empty and must have is string
  return_value = link != "" and isinstance(link, str)

  if not return_value and not autotest: # not print if autotest
    print(f"{link_name} is not valid")

  return return_value


def get_data_frame_size_info(df=None,df_name="Data Frame",autotest=False):

  if not isinstance(df, pd.DataFrame):
    if not autotest:
      print(f"{df_name} is not a pandas DataFrame type")
    return False

  num_rows, num_columns = df.shape
  size_info = f"[{num_rows} rows x {num_columns} columns]"
  return size_info


def get_summary(url=""):
  # add code below which prints the first 5 rows of the dataset, the info and the numerical summary

  if not is_valid_link(url,"url"):
    return False

  titanic = pd.read_csv(url)

  # get first 5 rows
  titanic_head = titanic.head()

  # get first 3 col
  display_left_head = titanic_head.iloc[:, :3]
  display_left_head["..."] = " ... "

  # get last 3 col
  display_right_head = titanic_head.iloc[:, -3:]

  # get together dataframes for display
  display_head = pd.concat([display_left_head, display_right_head], axis=1)
  print("Head:")
  display(display_head)
  print(get_data_frame_size_info(titanic_head,"display head"))

  print("\n\nInfo:")
  display(titanic.info())

  print("\n\nDescribe:")
  describe_df = titanic.describe()
  display(describe_df)
  print(get_data_frame_size_info(describe_df,"describe data"))



#titanic = #read your dataset into this variable (dont forget to import pandas first)
url = "https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv"

# run and visually test using example above
get_summary(url)



Head:


Unnamed: 0,PassengerId,Survived,Pclass,...,Fare,Cabin,Embarked
0,1,0,3,...,7.25,,S
1,2,1,1,...,71.2833,C85,C
2,3,1,3,...,7.925,,S
3,4,1,1,...,53.1,C123,S
4,5,0,3,...,8.05,,S


[5 rows x 12 columns]


Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


None



Describe:


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


[8 rows x 7 columns]


### Exercise 2 - displaying other statistics
---

Take a look at the list of methods available for giving summary statistics [here](https://pandas.pydata.org/docs/user_guide/basics.html#basics-stats).  (You will need to use `.mode()` in this exercise)

Use panda functions, and your existing knowledge, to display the following summary statistics from the titanic dataset:

Write a function called **get_statistics()** which returns:

1.  The total number of passengers on the titanic
2.  The age of the youngest passenger
3.  The most expensive ticket price
4.  The range of ticket prices
5.  The number of passengers with cabins
6.  The code for the port where the highest number of passengers embarked
7.  The most populous gender
8.  The standard deviation for age

In [13]:
def get_statistics(url=""):
  # add code below to return the stats listed above

  if not is_valid_link(url,"url"):
    return False

  #passengers, youngest, most_expensive, range_ticket, no_cabins, embarked[0], gender[0], sd
  passengers_df = pd.read_csv(url)

  # 1. The total number of passengers on the titanic
  passengers = len(passengers_df)

  # 2. The age of the youngest passenger
  youngest = passengers_df['Age'].min()

  # 3. The most expensive ticket price
  most_expensive = passengers_df['Fare'].max()

  # 4. The range of ticket prices
  range_ticket = passengers_df['Fare'].max() - passengers_df['Fare'].min()

  # 5. The number of passengers with cabins
  no_cabins = passengers_df['Cabin'].notna().sum()

  # 6. The code for the port where the highest number of passengers embarked
  embarked = passengers_df['Embarked'].mode()

  # 7. The most populous gender
  gender = passengers_df['Sex'].mode()

  # 8. The standard deviation for age
  sd = passengers_df['Age'].std()


  return passengers, youngest, most_expensive, range_ticket, no_cabins, embarked[0], gender[0], sd


# This will run and test your function to see if answers are correct
actual = get_statistics(url)
expected = (891, 0.42, 512.3292, 512.3292, 204, 'S', 'male', 14.526497332334042)

if actual == expected:
  print("Test passed", actual)
else:
  print("Test failed, expected", expected, "but got", actual)


Test passed (891, 0.42, 512.3292, 512.3292, 204, 'S', 'male', 14.526497332334042)


### Exercise 3 - aggregating statistics grouped by category
---

Refer again to the tutorial  
[How to calculate summary statistics?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html#)   
looking particularly at the section on Aggregating statistics grouped by category.

Write a function called **grouped** which displays:

1.  The mean age for male versus female Titanic passengers?
2.  The mean ticket fare price for each of the sex and cabin class combinations?
3.  The mean ticket fare price for passengers who embarked at each port?
4.  Which passenger class had the highest number of survivors (for now, just show the statistics - how many survivors in each class - you can identify the class visually)

**Expected output**
```
            Age
 Sex              
 female  27.915709
 male    30.726645,

 Pclass  Sex   
 1       female    106.125798
         male       67.226127
 2       female     21.970121
         male       19.741782
 3       female     16.118810
         male       12.661633

 Name: Fare, dtype: float64

 Embarked
 C    59.954144
 Q    13.276030
 S    27.079812
 Name: Fare, dtype: float64

       Survived
 Pclass
 1          136
 2           87
 3          119
 ```

In [26]:
def get_grouped(url=""):
  # add code below to return the above stats

  if not is_valid_link(url,"url"):
    return False

  #get data frame
  df = pd.read_csv(url)

  # 1. The mean age for male versus female Titanic passengers?
  print("Mean age for male versus female Titanic passengers:")
  mean_age = df.groupby("Sex")["Age"].mean()
  print(mean_age)

  # 2. The mean ticket fare price for each of the sex and cabin class combinations?
  print("\n\nMean ticket fare price for each of the sex and cabin class combinations:")
  mean_fare = df.groupby(["Pclass", "Sex"])["Fare"].mean()
  print(mean_fare)

  # 3. The mean ticket fare price for passengers who embarked at each port?
  print("\n\nMean ticket fare price for passengers who embarked at each port:")
  mean_fare_port = df.groupby("Embarked")["Fare"].mean()
  print(mean_fare_port)

  # 4. Which passenger class had the highest number of survivors
  #   (for now, just show the statistics
  #    - how many survivors in each class
  #    - you can identify the class visually)
  print("\n\nWhich passenger class had the highest number of survivors:")
  survivors = df.groupby("Pclass")["Survived"].sum()
  print(survivors)


# run and test visually using the above expected output
get_grouped(url)



Mean age for male versus female Titanic passengers:
Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64


Mean ticket fare price for each of the sex and cabin class combinations:
Pclass  Sex   
1       female    106.125798
        male       67.226127
2       female     21.970121
        male       19.741782
3       female     16.118810
        male       12.661633
Name: Fare, dtype: float64


Mean ticket fare price for passengers who embarked at each port:
Embarked
C    59.954144
Q    13.276030
S    27.079812
Name: Fare, dtype: float64


Which passenger class had the highest number of survivors:
Pclass
1    136
2     87
3    119
Name: Survived, dtype: int64


### Exercise 4 - using iloc
---
Write a function called **get_middle** to:
*   display the middle 20 records (use the shape of the dataframe to help you identify the index positions of these)


In [37]:
def get_middle(url=""):
  # add code below to return middle 20 records

  if not is_valid_link(url,"url"):
    return False

  #get data frame
  df = pd.read_csv(url)

  #get middle record position
  middle_idx = len(df) // 2
  middle_df = df.iloc[middle_idx - 9: middle_idx + 9]
  display(middle_df)

  #get middle 20 records
  return middle_df


# run and test if your returned 20 records starts at correct index

actual = get_middle(url).index[0]
expected = 436

if actual == expected:
  print("Test passed", actual)
else:
  print("Test failed expected index of", expected, "got", actual)

# If you are failing the test by 1 out, why might this be? Think about what happens when you use floor division vs the round() function



Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
436,437,0,3,"Ford, Miss Doolina Margaret ""Daisy""",female,21.0,2,2,W./C. 6608,34.375,,S
437,438,1,2,"Richards, Mrs. Sidney (Emily Hocking)",female,24.0,2,3,29106,18.75,,S
438,439,0,1,"Fortune, Mr. Mark",male,64.0,1,4,19950,263.0,C23 C25 C27,S
439,440,0,2,"Kvillner, Mr. Johan Henrik Johannesson",male,31.0,0,0,C.A. 18723,10.5,,S
440,441,1,2,"Hart, Mrs. Benjamin (Esther Ada Bloomfield)",female,45.0,1,1,F.C.C. 13529,26.25,,S
441,442,0,3,"Hampe, Mr. Leon",male,20.0,0,0,345769,9.5,,S
442,443,0,3,"Petterson, Mr. Johan Emil",male,25.0,1,0,347076,7.775,,S
443,444,1,2,"Reynaldo, Ms. Encarnacion",female,28.0,0,0,230434,13.0,,S
444,445,1,3,"Johannesen-Bratthammer, Mr. Bernt",male,,0,0,65306,8.1125,,S
445,446,1,1,"Dodge, Master Washington",male,4.0,0,2,33638,81.8583,A34,S


Test passed 436


### Exercise 5 - migration to and from
---

The Excel file at this link (which you have already opened above): https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true has three data sheets, "Country Migration", "Industry Migration" and "Skill Migration"

Read the data sheet "Country Migration" into a variable called **country**

Write a function called **get_uk_mig** that will return all the rows which had migration to the United Kingdom

In [38]:
def is_valid_link(link="",link_name="",autotest=False):
  # link always isn't empty and must have is string
  return_value = link != "" and isinstance(link, str)

  if not return_value and not autotest: # not print if autotest
    print(f"{link_name} is not valid")

  return return_value

def get_excel_data(url="",sheet_name="default"):
  # url and sheet name always isn't empty and must have is string
  is_not_valid_url = not is_valid_link(url,"url")
  is_not_valid_sheet_name = not is_valid_link(sheet_name,"sheet name")
  if is_not_valid_url or is_not_valid_sheet_name:
    return False

  if sheet_name == "default":
    df = pd.read_excel(url)
  else:
    df = pd.read_excel(url,sheet_name)

  return df


def get_uk_mig(url=""):
  # add code below to return all rows which had migration to the UK

  if not is_valid_link(url,"url"):
    return False

  #get data frame
  country = get_excel_data(url,sheet_name="Country Migration")

  #return all the rows which had migration to the United Kingdom
  return_df = country[country["target_country_name"] == "United Kingdom"]
  display(return_df)

  return return_df





# run and test if your returned series is the correct length
url = "https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true"

actual = len(get_uk_mig(url))
expected = 122

if actual == expected:
  print("Test passed", actual)
else:
  print("Test failed expected", expected, "got", actual)



Unnamed: 0,base_country_code,base_country_name,base_lat,base_long,base_country_wb_income,base_country_wb_region,target_country_code,target_country_name,target_lat,target_long,target_country_wb_income,target_country_wb_region,net_per_10K_2015,net_per_10K_2016,net_per_10K_2017,net_per_10K_2018,net_per_10K_2019
85,ae,United Arab Emirates,23.424076,53.847818,High Income,Middle East & North Africa,gb,United Kingdom,55.378051,-3.435973,High Income,Europe & Central Asia,12.41,4.45,0.84,-1.78,-4.03
112,af,Afghanistan,33.939110,67.709953,Low Income,South Asia,gb,United Kingdom,55.378051,-3.435973,High Income,Europe & Central Asia,-1.57,-1.32,-1.20,0.94,0.89
126,al,Albania,41.153332,20.168331,Upper Middle Income,Europe & Central Asia,gb,United Kingdom,55.378051,-3.435973,High Income,Europe & Central Asia,-1.89,-3.61,-1.71,-2.99,-4.41
133,am,Armenia,40.069099,45.038189,Upper Middle Income,Europe & Central Asia,gb,United Kingdom,55.378051,-3.435973,High Income,Europe & Central Asia,0.00,-0.25,-0.86,-0.12,-0.63
148,ao,Angola,-11.202692,17.873887,Lower Middle Income,Sub-Saharan Africa,gb,United Kingdom,55.378051,-3.435973,High Income,Europe & Central Asia,5.20,1.29,-0.64,0.10,2.90
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4026,ve,"Venezuela, RB",6.423750,-66.589730,Upper Middle Income,Latin America & Caribbean,gb,United Kingdom,55.378051,-3.435973,High Income,Europe & Central Asia,-0.89,-0.99,-1.17,-0.81,-0.78
4057,vn,Vietnam,14.058324,108.277199,Lower Middle Income,East Asia & Pacific,gb,United Kingdom,55.378051,-3.435973,High Income,Europe & Central Asia,1.32,1.35,-0.16,0.42,0.10
4121,za,South Africa,-30.559482,22.937506,Upper Middle Income,Sub-Saharan Africa,gb,United Kingdom,55.378051,-3.435973,High Income,Europe & Central Asia,14.26,6.48,0.98,-0.42,-3.71
4132,zm,Zambia,-13.133897,27.849332,Lower Middle Income,Sub-Saharan Africa,gb,United Kingdom,55.378051,-3.435973,High Income,Europe & Central Asia,43.27,27.60,7.88,6.90,3.68


Test passed 122


### Exercise 6 - how many countries are migrated to

Using the "Country Migration" sheet again, get the total number of unique country names of where people have migrated from.


In [44]:
def migration(url=""):
  #add code below to return the total number of unique country names of where people have migrated from

  if not is_valid_link(url,"url"):
    return False

  #get data frame
  migrations = get_excel_data(url,sheet_name="Country Migration")

  #get unique country names of where people have migrated from
  return len(migrations["base_country_name"].unique())


# run and test if you have the correct number of unique countries
actual = migration(url)
expected = 140

if actual == expected:
  print("Test passed", actual)
else:
  print("Test failed expected", expected, "got", actual)


Test passed 140


# Reflection
----

## What skills have you demonstrated in completing this notebook?

Your answer:

- Rapid new skills development
- Open to new ideas
- Analytical and Outside The Box Thinking
- Passionate about software
- Crisis Resolution
- Attention to Detail
- Agile




## What caused you the most difficulty?

Your answer:

Exercise #2. Visualization of the collected data exactly as in "Expected Output". I still haven't managed to achieve perfect accuracy.

It looks very similar, but it's not filigree accuracy.