# **Fittlyf Data Science Assignment**
This python notebook contains the seven assignements/tasks given during the interview process of Fittlyf for the position of a Data Science Intern. 


## *Importing the dependencies:*

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib
%matplotlib inline

In [2]:
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

## *Reading the data:*

Taking the given data "unemployment_analysis.csv" as input, as a pandas dataframe:

In [3]:
df = pd.read_csv('/content/unemployment_analysis.csv')

Converting strings into floats (6.8 lacs to 6.8) for all the columns containing years: 

*Here, we define a function 'convert_to_float' which takes a string as input, and then splits the strings on the whitespace. It, then converts the first part of the string into a float and returns it.*

In [4]:
def convert_to_float(year_str):
    try:
        return float(year_str.split()[0])
    except:
        return year_str

*Now, for every column where the column name is numeric, we apply the 'convert_to_float' function.*

In [5]:
for column in df.columns:
    if column.isnumeric():
        df[column] = df[column].apply(convert_to_float)

*Below is the dataframe where all the year-strings have been converted into float values.*

In [8]:
df

Unnamed: 0,Country Name,Country Code,1991,1992,1993,1994,1995,1996,1997,1998,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,Africa Eastern and Southern,AFE,7.80,7.84,7.85,7.84,7.83,7.84,7.86,7.81,...,6.56,6.45,6.41,6.49,6.61,6.71,6.73,6.91,7.56,8.11
1,Afghanistan,AFG,10.65,10.82,10.72,10.73,11.18,10.96,10.78,10.80,...,11.34,11.19,11.14,11.13,11.16,11.18,11.15,11.22,11.71,13.28
2,Africa Western and Central,AFW,4.42,4.53,4.55,4.54,4.53,4.57,4.60,4.66,...,4.64,4.41,4.69,4.63,5.57,6.02,6.04,6.06,6.77,6.84
3,Angola,AGO,4.21,4.21,4.23,4.16,4.11,4.10,4.09,4.07,...,7.35,7.37,7.37,7.39,7.41,7.41,7.42,7.42,8.33,8.53
4,Albania,ALB,10.31,30.01,25.26,20.84,14.61,13.93,16.88,20.05,...,13.38,15.87,18.05,17.19,15.42,13.62,12.30,11.47,13.33,11.82
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
230,Samoa,WSM,2.10,2.38,2.63,3.04,3.19,3.47,3.90,4.18,...,8.75,8.67,8.72,8.50,8.31,8.58,8.69,8.41,9.15,9.84
231,"Yemen, Rep.",YEM,8.32,8.31,8.35,8.34,8.96,9.59,10.20,10.81,...,13.17,13.27,13.47,13.77,13.43,13.30,13.15,13.06,13.39,13.57
232,South Africa,ZAF,29.95,29.98,29.92,29.89,29.89,29.87,29.91,29.95,...,24.73,24.56,24.89,25.15,26.54,27.04,26.91,28.47,29.22,33.56
233,Zambia,ZMB,18.90,19.37,19.70,18.43,16.81,15.30,13.64,12.00,...,7.85,8.61,9.36,10.13,10.87,11.63,12.01,12.52,12.85,13.03


Displaying the column names with their datatypes: 

In [7]:
print(df.dtypes)

Country Name     object
Country Code     object
1991            float64
1992            float64
1993            float64
1994            float64
1995            float64
1996            float64
1997            float64
1998            float64
1999            float64
2000            float64
2001            float64
2002            float64
2003            float64
2004            float64
2005            float64
2006            float64
2007            float64
2008            float64
2009            float64
2010            float64
2011            float64
2012            float64
2013            float64
2014            float64
2015            float64
2016            float64
2017            float64
2018            float64
2019            float64
2020            float64
2021            float64
dtype: object


Generating the unique country names along with their corresponding country code: 

*Here, we create a new pandas dataframe named 'country_df' which contains only two coumns, country name and country code.*

In [9]:
country_df = df[['Country Name', 'Country Code']]

*We will drop the duplicate row values from the columns to generate only unique values of country name/code.*

In [11]:
country_df = country_df.drop_duplicates()
country_df

Unnamed: 0,Country Name,Country Code
0,Africa Eastern and Southern,AFE
1,Afghanistan,AFG
2,Africa Western and Central,AFW
3,Angola,AGO
4,Albania,ALB
...,...,...
230,Samoa,WSM
231,"Yemen, Rep.",YEM
232,South Africa,ZAF
233,Zambia,ZMB


Finding all the unemployed people throughout all the years, for each country: 

*The function 'unemployment_sum' separates the columns having float values, and then returns the sum of all the columns, for each row, along with the corresponding country name.*

In [18]:
def row_sum(row):
    num_cols = [col for col in df.columns if col.isnumeric()]
    row_sum = row[num_cols].sum()
    return pd.Series({'Country Name': row['Country Name'], 'Unemployed People': row_sum})

*We store the sum of all the row values in 'sums'and then print that.*

In [20]:
unemployment_sum_df = df.apply(row_sum, axis=1)
unemployment_sum_df

Unnamed: 0,Country Name,Unemployed People
0,Africa Eastern and Southern,224.57
1,Afghanistan,345.38
2,Africa Western and Central,153.27
3,Angola,169.02
4,Albania,505.86
...,...,...
230,Samoa,180.98
231,"Yemen, Rep.",365.60
232,South Africa,875.21
233,Zambia,407.96


## *Data Cleaning:*

### **Task 1:**
Write a function in python called data_cleaning() which, when called, would perform the following activity:
1. Find for any missing value in all the columns, display them.
2. If any missing value exists, then replace them with the average of the corresponding country. Then, again, check for null values.
3. For the countries ‘Benin’, ‘Bahrain’, find if any outliers exist. If yes, replace them with mean/median/mode.
4. Create a new column, named, “Year”, which would have all the years as per each country & beside that column, add a new one named, “No. of unemployed”, which would have the corresponding total values.
5. Change the column name “Country Name” to “Country_name” & “Country Code” to “Country_code”.

In [61]:
def data_cleaning(df):
    # Check for missing values
    missing_values = df.isnull().sum()
    missing_cols = missing_values[missing_values > 0]
    print("Missing Values:\n", missing_cols)

In [62]:
data_cleaned = data_cleaning(df)

Missing Values:
 2000    1
2015    1
dtype: int64
