## 1 .2 Download the dataset from the this URL 
Load it into a DataFrame

In [3]:
import pandas as pd

df = pd.read_csv(r"C:\Users\User\OneDrive\Desktop\Data-Analysis_Practice\car_insurance.csv")

In [2]:
df

Unnamed: 0,Id,Age,Job,Marital,Education,Default,Balance,HHInsurance,CarLoan,Communication,LastContactDay,LastContactMonth,NoOfContacts,DaysPassed,PrevAttempts,Outcome,CallStart,CallEnd,CarInsurance
0,,,,,,,,,,,,,,,,,,,
1,1.0,32.0,management,single,tertiary,0.0,1218.0,1.0,0.0,telephone,28.0,jan,2.0,-1.0,0.0,,13:45:20,13:46:30,0.0
2,,,,,,,,,,,,,,,,,,,
3,2.0,32.0,blue-collar,married,primary,0.0,1156.0,1.0,0.0,,26.0,may,5.0,-1.0,0.0,,14:49:03,14:52:08,0.0
4,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7995,3998.0,27.0,admin.,single,secondary,0.0,-400.0,0.0,1.0,cellular,8.0,jul,1.0,-1.0,0.0,,12:19:03,12:23:53,0.0
7996,,,,,,,,,,,,,,,,,,,
7997,3999.0,36.0,entrepreneur,single,tertiary,0.0,658.0,1.0,0.0,cellular,29.0,jan,1.0,227.0,3.0,failure,11:27:35,11:29:14,0.0
7998,,,,,,,,,,,,,,,,,,,


# 3 .Observe the name of the columns and the corresponding data type

The info() method provides a concise summary of the DataFrame, including the column names, data types, and the number of non-null values.

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Id                4000 non-null   float64
 1   Age               4000 non-null   float64
 2   Job               3981 non-null   object 
 3   Marital           4000 non-null   object 
 4   Education         3831 non-null   object 
 5   Default           4000 non-null   float64
 6   Balance           4000 non-null   float64
 7   HHInsurance       4000 non-null   float64
 8   CarLoan           4000 non-null   float64
 9   Communication     3098 non-null   object 
 10  LastContactDay    4000 non-null   float64
 11  LastContactMonth  4000 non-null   object 
 12  NoOfContacts      4000 non-null   float64
 13  DaysPassed        4000 non-null   float64
 14  PrevAttempts      4000 non-null   float64
 15  Outcome           958 non-null    object 
 16  CallStart         4000 non-null   object 


# Observations

- Column Data Types 
There 19 columns in total , 11 columns are type "float64
8 columns are object types (usually representing strings or categorical data )

- Most columns have 4000 non-null values, indicating that the other 4000 entries are missing for these columns. This suggests the data might be split into two different sets or there are two different sources merged into one DataFrame.
- Columns like Job, Education, Communication, and Outcome have significantly fewer non-null entries, indicating more missing data. (fewer non null entries---more missing data)
- Memory Usage:

The DataFrame uses approximately 1.2 MB of memory.

The dtypes attribute provides a more straightforward way to see the data types of each column.

In [22]:
print(df.dtypes)

Id                  float64
Age                 float64
Job                  object
Marital              object
Education            object
Default             float64
Balance             float64
HHInsurance         float64
CarLoan             float64
Communication        object
LastContactDay      float64
LastContactMonth     object
NoOfContacts        float64
DaysPassed          float64
PrevAttempts        float64
Outcome              object
CallStart            object
CallEnd              object
CarInsurance        float64
dtype: object


# Conclusions:
### Data Completeness:

- Columns with less than 4000 non-null values need attention due to missing data. Job, Education, Communication, and Outcome particularly have notable amounts of missing values.
The Outcome column is significantly sparse with only 958 non-null values. This could impact any analysis dependent on this column.
Potential Data Issues:

- The presence of exactly 4000 non-null values in many columns suggests that there might be two distinct datasets combined into one. Further investigation is needed to understand this structure and handle it appropriately.
- The high amount of missing values in some columns might require data imputation or exclusion of these columns from certain analyses.Data imputation and exclusion are two common techniques for handling missing data in pandas, a popular data manipulation library in Python.Data imputation involves filling in missing values in a dataset with estimated values based on the available data. This can be done using various methods (mean, median, mode, forward and backword fill and Interpolation imputation methods).Data Exclusion : Data exclusion involves removing rows or columns with missing values from the dataset. This can be done when the proportion of missing data is small, and you prefer to work with complete cases.Exclusion methods include : 1. Drop Rows with Missing Values 2.Drop Columns with Missing Values.3.Drop Rows with Missing Values in Specific Columns

### Data Types:

- The columns CallStart and CallEnd are of type object, which suggests they might contain time data. Converting these to datetime objects could be beneficial for time-based analyses.
- Other object type columns likely represent categorical data (e.g., Job, Marital, Education, Communication, LastContactMonth, Outcome). Encoding these columns properly (e.g., one-hot encoding) would be important for machine learning tasks.Categorical data in pandas refers to a data type that represents variables which can take on a limited, fixed number of possible values. These values are often used to label categories and are useful in statistical modeling. 

### Next Steps:

- Handling Missing Data: Decide on a strategy for handling missing values. This could include imputation, removal, or using algorithms that can handle missing values natively.
- Data Transformation: Convert columns like CallStart and CallEnd to datetime objects for more accurate time-based analysis.
- Categorical Encoding: Convert categorical columns into a suitable format for analysis, such as using one-hot encoding for machine learning models.
- Investigate Data Source: Understand why there are 4000 non-null values in most columns but 8000 entries in total. This might involve investigating the data collection process or checking for possible data entry errors.

### 4 .Check how many missing values each column has

 To Check for missing values we can use the combination of isnull() and sum() to get the count of missing values for each column:

In [23]:
missing_values = df.isnull().sum()

In [24]:
print(missing_values)

Id                  4000
Age                 4000
Job                 4019
Marital             4000
Education           4169
Default             4000
Balance             4000
HHInsurance         4000
CarLoan             4000
Communication       4902
LastContactDay      4000
LastContactMonth    4000
NoOfContacts        4000
DaysPassed          4000
PrevAttempts        4000
Outcome             7042
CallStart           4000
CallEnd             4000
CarInsurance        4000
dtype: int64


### 5.Drop the rows containing missing values

To drop the rows containing missing values in your DataFrame, you can use the dropna() method provided by pandas.This will create a new DataFrame df_cleaned where all rows containing any missing values have been removed.you don't need to import numpy to drop rows with missing values in a pandas DataFrame. The dropna() method is part of the pandas library and does not require numpy.

However, if you are performing other operations that involve numerical computations or require numpy-specific functions, you might need to import numpy. 

In [25]:

df.dropna(inplace=True) 

Μετα την χρηση του παραπανω κωδικα παρατηρω ότι οι Nan values ειναι ακομη εκει:Μετα τη χρησιμοποιηση του παρακατω κώδικα άλλξαν όλα τα αποτελέσματα μου.Ισως θα επρεπε να αφαιρεσω όλες της Ναν αξιες απο την αρχη!

![image.png](attachment:image.png)

dropna() απο μονο του διαγραφει όλες τις γραμμές που περιεχουν τουλαχιστον μια ΝΑΝ αξια.Από το αρχειο ομως αυτό καταλαβαινω ότι πρέπει να χρησιμοποιησω το how=all για να διαγραψω όλες τις γραμμές.

In [26]:
df.dropna(how='all', inplace=True)

In [27]:
df

Unnamed: 0,Id,Age,Job,Marital,Education,Default,Balance,HHInsurance,CarLoan,Communication,LastContactDay,LastContactMonth,NoOfContacts,DaysPassed,PrevAttempts,Outcome,CallStart,CallEnd,CarInsurance
5,3.0,29.0,management,single,tertiary,0.0,637.0,1.0,0.0,cellular,3.0,jun,1.0,119.0,1.0,failure,16:30:24,16:36:04,1.0
11,6.0,32.0,technician,single,tertiary,0.0,1625.0,0.0,0.0,cellular,22.0,may,1.0,109.0,1.0,failure,14:58:08,15:11:24,1.0
31,16.0,61.0,management,single,tertiary,0.0,2.0,0.0,0.0,cellular,12.0,aug,1.0,114.0,3.0,failure,16:18:48,16:20:59,1.0
33,17.0,34.0,admin.,single,secondary,0.0,69.0,1.0,0.0,telephone,6.0,may,3.0,362.0,4.0,other,11:48:45,11:50:17,0.0
35,18.0,46.0,management,married,tertiary,0.0,7331.0,0.0,0.0,cellular,11.0,sep,4.0,95.0,2.0,other,11:23:26,11:34:24,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7975,3988.0,27.0,admin.,married,tertiary,0.0,2855.0,0.0,0.0,cellular,25.0,jan,1.0,301.0,3.0,failure,09:55:44,09:59:44,1.0
7981,3991.0,27.0,technician,single,secondary,0.0,126.0,1.0,1.0,cellular,5.0,feb,2.0,216.0,4.0,failure,13:30:49,13:33:16,0.0
7985,3993.0,34.0,technician,married,secondary,0.0,0.0,1.0,0.0,cellular,5.0,aug,2.0,2.0,3.0,success,10:51:19,10:55:10,1.0
7991,3996.0,28.0,technician,single,tertiary,0.0,0.0,1.0,0.0,cellular,25.0,may,1.0,40.0,2.0,failure,17:46:28,17:50:57,1.0


### 5. For the numerical values, obtain the mean average value

First, I want to find which columns contain numerical values, without looking for these manually.This code snippet will give you a list of column names that contain only numerical values in your DataFrame.:

In [28]:
# Select columns with numerical values only
import numpy as np

numerical_columns = df.select_dtypes(include=[np.number]).columns.tolist()

### Explanation:
- import pandas as pd: Imports the pandas library.(Done)
- import numpy as np: Imports the numpy library which is used to specify the number type.(Done)
- df.select_dtypes(include=[np.number]): Selects columns of type number (both integers and floats)
- .columns.tolist(): Converts the column names to a list for easier use.


In [29]:
print("List of all columns that contain only numerical Data:", numerical_columns)

List of all columns that contain only numerical Data: ['Id', 'Age', 'Default', 'Balance', 'HHInsurance', 'CarLoan', 'LastContactDay', 'NoOfContacts', 'DaysPassed', 'PrevAttempts', 'CarInsurance']


In [30]:
numerical_columns = ['Id', 'Age', 'Default', 'Balance', 'HHInsurance', 'CarLoan', 'LastContactDay', 'NoOfContacts', 'DaysPassed', 'PrevAttempts', 'CarInsurance']

Calculate mean for each numerical column : 

In [31]:
mean_values = df[numerical_columns].mean() #f[numerical_columns].mean() selects only the numerical columns from df and computes the mean for each column.

In [32]:
mean_values

Id                2004.713341
Age                 41.261301
Default              0.004410
Balance           1741.250276
HHInsurance          0.509372
CarLoan              0.098126
LastContactDay      14.407938
NoOfContacts         1.929438
DaysPassed         204.815877
PrevAttempts         2.987872
CarInsurance         0.579934
dtype: float64

### Conclusion :
To find the mean average value for each column in your dataframe where the columns contain only numerical values, you can follow these steps in Python using pandas:

1. Identify Numerical Columns: You mentioned you have already located which columns contain only numerical values. Let's assume you have a list of these column names.

2. Calculate Mean for Each Column: Using pandas, you can easily compute the mean for each numerical column.

### 7.Observe the Job column. How many categories can you find? What is the most frequent?

Counting Unique Values: df['Job'].value_counts() computes the number of occurrences of each unique value in the 'Job' column and returns it as a pandas Series where the index is the unique job names and the values are their counts.

In [33]:
job_counts = df["Job"].value_counts()

In [59]:
print("Number of each job category in column :",job_counts)

Number of each job category in column : Job
management       231
technician       152
blue-collar      131
admin.           114
services          69
retired           64
unemployed        48
student           40
self-employed     27
entrepreneur      16
housemaid         15
Name: count, dtype: int64


Number of Unique Values: len(job_counts) gives you the number of unique job types

In [35]:
unique_jobs = len(job_counts)

In [60]:
print("Number of different Job categories are :",unique_jobs)

Number of different Job categories are : 11


### 8.Create a new column with the duration of each call in seconds

Thinking...

Lets show the columns that show the start and end duration of the call

Data Preparation: First, create a DataFrame (df) with columns StartTime and EndTime containing datetime strings.


In [37]:
selected_columns = df[["CallStart", "CallEnd"]]

In [38]:
selected_columns

Unnamed: 0,CallStart,CallEnd
5,16:30:24,16:36:04
11,14:58:08,15:11:24
31,16:18:48,16:20:59
33,11:48:45,11:50:17
35,11:23:26,11:34:24
...,...,...
7975,09:55:44,09:59:44
7981,13:30:49,13:33:16
7985,10:51:19,10:55:10
7991,17:46:28,17:50:57


### Before any calculations we need to check the data type the column StartCall and EndCall is to see if we can process further?

Convert to Datetime: Use pd.to_datetime() to convert the StartTime and EndTime columns to Pandas datetime objects. This step ensures that Pandas recognizes these columns as datetime types, allowing for datetime arithmetic.


Based on the format of this data (CallStart and CallEnd columns with times like "16:30:24"), it appears these columns contain time values rather than datetime objects. In Pandas, when you read in data with times in this format, they are typically treated as strings (object dtype) unless explicitly converted to datetime objects.

To check if these columns are already converted to datetime objects or if they are still strings, you can inspect the data types of the DataFrame columns using df.dtypes. Here's how you can do it:

In [39]:
print(df.dtypes)

Id                  float64
Age                 float64
Job                  object
Marital              object
Education            object
Default             float64
Balance             float64
HHInsurance         float64
CarLoan             float64
Communication        object
LastContactDay      float64
LastContactMonth     object
NoOfContacts        float64
DaysPassed          float64
PrevAttempts        float64
Outcome              object
CallStart            object
CallEnd              object
CarInsurance        float64
dtype: object


CallStart and CallEnd are indeed times of day (as inferred from their object data type).We want to convert them to Pandas datetime objects for easier manipulation, such as calculating durations between them.

### How to convert object data typet to datetime objects for easier manipulation :

In [40]:
## Convert CallStart and CallEnd to datetime objects
df["CallStart"] = pd.to_datetime(df['CallStart'], format='%H:%M:%S')
df["CallEnd"] = pd.to_datetime(df['CallEnd'], format='%H:%M:%S')

In [41]:
#Check datatypes after conversion
print(df.dtypes)

Id                         float64
Age                        float64
Job                         object
Marital                     object
Education                   object
Default                    float64
Balance                    float64
HHInsurance                float64
CarLoan                    float64
Communication               object
LastContactDay             float64
LastContactMonth            object
NoOfContacts               float64
DaysPassed                 float64
PrevAttempts               float64
Outcome                     object
CallStart           datetime64[ns]
CallEnd             datetime64[ns]
CarInsurance               float64
dtype: object


### Now, we need to deduct CallEnd-CallStart

and add the new column name "Duration in Seconds"

In [62]:
df["Duration in Seconds"] = (df["CallEnd"] - df["CallStart"]).dt.total_seconds()

In [61]:
df

Unnamed: 0,Id,Age,Job,Marital,Education,Default,Balance,HHInsurance,CarLoan,Communication,LastContactDay,LastContactMonth,NoOfContacts,DaysPassed,PrevAttempts,Outcome,CallStart,CallEnd,CarInsurance,Duration in Seconds
5,3.0,29.0,management,single,tertiary,0.0,637.0,1.0,0.0,cellular,3.0,jun,1.0,119.0,1.0,failure,1900-01-01 16:30:24,1900-01-01 16:36:04,1.0,340.0
11,6.0,32.0,technician,single,tertiary,0.0,1625.0,0.0,0.0,cellular,22.0,may,1.0,109.0,1.0,failure,1900-01-01 14:58:08,1900-01-01 15:11:24,1.0,796.0
31,16.0,61.0,management,single,tertiary,0.0,2.0,0.0,0.0,cellular,12.0,aug,1.0,114.0,3.0,failure,1900-01-01 16:18:48,1900-01-01 16:20:59,1.0,131.0
33,17.0,34.0,admin.,single,secondary,0.0,69.0,1.0,0.0,telephone,6.0,may,3.0,362.0,4.0,other,1900-01-01 11:48:45,1900-01-01 11:50:17,0.0,92.0
35,18.0,46.0,management,married,tertiary,0.0,7331.0,0.0,0.0,cellular,11.0,sep,4.0,95.0,2.0,other,1900-01-01 11:23:26,1900-01-01 11:34:24,1.0,658.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7975,3988.0,27.0,admin.,married,tertiary,0.0,2855.0,0.0,0.0,cellular,25.0,jan,1.0,301.0,3.0,failure,1900-01-01 09:55:44,1900-01-01 09:59:44,1.0,240.0
7981,3991.0,27.0,technician,single,secondary,0.0,126.0,1.0,1.0,cellular,5.0,feb,2.0,216.0,4.0,failure,1900-01-01 13:30:49,1900-01-01 13:33:16,0.0,147.0
7985,3993.0,34.0,technician,married,secondary,0.0,0.0,1.0,0.0,cellular,5.0,aug,2.0,2.0,3.0,success,1900-01-01 10:51:19,1900-01-01 10:55:10,1.0,231.0
7991,3996.0,28.0,technician,single,tertiary,0.0,0.0,1.0,0.0,cellular,25.0,may,1.0,40.0,2.0,failure,1900-01-01 17:46:28,1900-01-01 17:50:57,1.0,269.0


### Show calumns CallStart and CallEnd and Duration in Seconds side by side

In [44]:
df[["CallStart", "CallEnd", "Duration in Seconds"]]

Unnamed: 0,CallStart,CallEnd,Duration in Seconds
5,1900-01-01 16:30:24,1900-01-01 16:36:04,340.0
11,1900-01-01 14:58:08,1900-01-01 15:11:24,796.0
31,1900-01-01 16:18:48,1900-01-01 16:20:59,131.0
33,1900-01-01 11:48:45,1900-01-01 11:50:17,92.0
35,1900-01-01 11:23:26,1900-01-01 11:34:24,658.0
...,...,...,...
7975,1900-01-01 09:55:44,1900-01-01 09:59:44,240.0
7981,1900-01-01 13:30:49,1900-01-01 13:33:16,147.0
7985,1900-01-01 10:51:19,1900-01-01 10:55:10,231.0
7991,1900-01-01 17:46:28,1900-01-01 17:50:57,269.0


### 9.What is the average duration of each call?

To calculate the average duration of each call we need to add all values in "Duration in Seconds" columns and then devide these by the number of rows ?(It appears here the number of rows are 907 . ) I am thinking if I could just do that directly without converting Duration in seconds into another data type so as to add it or , I can add them straight away...

Thinking....

....Maybe see what data type "Duration is seconds" is. I only want to see this particular column type not all.How can I do that?

In [45]:
duration_column_datatype = df["Duration in Seconds"].dtypes

In [46]:
duration_column_datatype

dtype('float64')

It appears that the date type here is a float64. Can I add floating numbers of a column? I guess I can.

Yes, you can add the values of a column that are of type float64 in a pandas DataFrame.

In [47]:
total_sum = df["Duration in Seconds"].sum()

In [48]:
total_sum

304524.0

It looks like we found the total duration in call seconds. Now , we need to make sure we have the correct number of rows , as in registered calls so that we devide the total_sum by their number? How can I find the total rows of a column , and is this valid? (see picture below :)

![image.png](attachment:image.png)

In [49]:
total_rows = len(df['Duration in Seconds'])
print("Total number of rows in this column is :",total_rows)

Total number of rows in this column is : 907


Indeed number of columns confirmed. 

### Calculating average duraition of calls in seconds :

In [50]:
average_duration = total_sum/total_rows
print("The average duration of each call in seconds is :", average_duration)

The average duration of each call in seconds is : 335.74862183020946


### 10.For those people contacted during the first half of the year (Jan-June). What is the most common way of communication (telephone, cellular...)?

Thinking....
First lets find the columns I need to inclde in order to unswer this question.
Communication and LastContactMonth---columns for analysis
Thinking....


Shall we analyse the column "Communication" first? As in count the occurences of each for of communication and then maybe filter these from Jan-Jun ?Not sure about the second part yet, but lets attempt the first idea?

In [52]:
df[["LastContactMonth", "Communication"]]

Unnamed: 0,LastContactMonth,Communication
5,jun,cellular
11,may,cellular
31,aug,cellular
33,may,telephone
35,sep,cellular
...,...,...
7975,jan,cellular
7981,feb,cellular
7985,aug,cellular
7991,may,cellular


### Defining the first sic months in a list

In [53]:
first_six_months = ['jan', 'feb', 'mar', 'apr', 'may', 'jun']

### Filtering the first 6 months :

In [56]:
filtered_df = df[df['LastContactMonth'].str.lower().isin(first_six_months)]
filtered_df

Unnamed: 0,Id,Age,Job,Marital,Education,Default,Balance,HHInsurance,CarLoan,Communication,LastContactDay,LastContactMonth,NoOfContacts,DaysPassed,PrevAttempts,Outcome,CallStart,CallEnd,CarInsurance,Duration in Seconds
5,3.0,29.0,management,single,tertiary,0.0,637.0,1.0,0.0,cellular,3.0,jun,1.0,119.0,1.0,failure,1900-01-01 16:30:24,1900-01-01 16:36:04,1.0,340.0
11,6.0,32.0,technician,single,tertiary,0.0,1625.0,0.0,0.0,cellular,22.0,may,1.0,109.0,1.0,failure,1900-01-01 14:58:08,1900-01-01 15:11:24,1.0,796.0
33,17.0,34.0,admin.,single,secondary,0.0,69.0,1.0,0.0,telephone,6.0,may,3.0,362.0,4.0,other,1900-01-01 11:48:45,1900-01-01 11:50:17,0.0,92.0
37,19.0,49.0,blue-collar,married,secondary,0.0,2039.0,1.0,0.0,cellular,6.0,may,1.0,169.0,2.0,failure,1900-01-01 12:42:54,1900-01-01 12:50:25,1.0,451.0
49,25.0,60.0,technician,married,secondary,0.0,824.0,1.0,0.0,cellular,9.0,feb,1.0,558.0,7.0,other,1900-01-01 16:30:52,1900-01-01 16:32:59,1.0,127.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7971,3986.0,40.0,technician,married,primary,0.0,644.0,1.0,0.0,cellular,16.0,apr,2.0,336.0,1.0,failure,1900-01-01 10:49:27,1900-01-01 10:51:25,0.0,118.0
7975,3988.0,27.0,admin.,married,tertiary,0.0,2855.0,0.0,0.0,cellular,25.0,jan,1.0,301.0,3.0,failure,1900-01-01 09:55:44,1900-01-01 09:59:44,1.0,240.0
7981,3991.0,27.0,technician,single,secondary,0.0,126.0,1.0,1.0,cellular,5.0,feb,2.0,216.0,4.0,failure,1900-01-01 13:30:49,1900-01-01 13:33:16,0.0,147.0
7991,3996.0,28.0,technician,single,tertiary,0.0,0.0,1.0,0.0,cellular,25.0,may,1.0,40.0,2.0,failure,1900-01-01 17:46:28,1900-01-01 17:50:57,1.0,269.0


### We see above all the results of communication in the first 6 months.Noticing rows are 526 now instead of 907 .Why is that? The rest of rows are communication June onwards.

In [58]:
communication_counts = filtered_df['Communication'].value_counts()
print("Communication in the first six months are : ", communication_counts)

Communication in the first six months are :  Communication
cellular     489
telephone     37
Name: count, dtype: int64
