# Pandas - Removing Duplicates
#### By taking a look at our test data set, we can assume that row 11 and 12 are duplicates.

#### To discover duplicates, we can use the duplicated() method.

#### The duplicated() method returns a Boolean values for each row:


In [1]:
#returns True for every row that is duplicate , otherwise False;
import pandas as pd  

df = pd.read_csv('data.csv')

print(df.duplicated()) 

0      False
1      False
2      False
3      False
4      False
       ...  
130    False
131    False
132    False
133    False
134    False
Length: 135, dtype: bool


# Removing Duplicates
#### To remove duplicates, use the drop_duplicates() method.

In [2]:
df.drop_duplicates(inplace=True)

Remember: The (inplace = True) will make sure that the method does NOT return a new DataFrame, but it will remove all duplicates from the original DataFrame.

# Pandas - Data Correlations

# Finding Relationships
#### A great aspect of the Pandas module is the corr() method.

#### The corr() method calculates the relationship between each column in your data set.

#### The examples in this page uses a CSV file called: 'data.csv'.

In [3]:
df.corr()

ValueError: could not convert string to float: '24/01/01'

# The error you're encountering occurs because the df.corr() method in pandas is trying to compute the correlation matrix, which only works with numerical data. It appears that your DataFrame contains non-numeric data, specifically a date-like string '24/01/01', which cannot be directly converted to a float for correlation calculation.

In [5]:
import numpy as np
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

numeric_df = df.select_dtypes(include=[np.number])

correlation_matrix = numeric_df.corr()

print(correlation_matrix)

          Duration     Pulse  Maxpulse  Calories
Duration  1.000000 -0.155180  0.037522  0.926192
Pulse    -0.155180  1.000000  0.770656  0.033410
Maxpulse  0.037522  0.770656  1.000000  0.233011
Calories  0.926192  0.033410  0.233011  1.000000


# Note: The corr() method ignores "not numeric" columns.

# Result Explained
#### The Result of the corr() method is a table with a lot of numbers that represents how well the relationship is between two columns.

#### The number varies from -1 to 1.

#### 1 means that there is a 1 to 1 relationship (a perfect correlation), and for this data set, each time a value went up in the first column, the other one went up as well.

#### 0.9 is also a good relationship, and if you increase one value, the other will probably increase as well.

#### -0.9 would be just as good relationship as 0.9, but if you increase one value, the other will probably go down.

#### 0.2 means NOT a good relationship, meaning that if one value goes up does not mean that the other will.

#### What is a good correlation? It depends on the use, but I think it is safe to say you have to have at least 0.6 (or -0.6) to call it a good correlation.

#### Perfect Correlation:
#### We can see that "Duration" and "Duration" got the number 1.000000, which makes sense, each column always has a perfect relationship with itself.

#### Good Correlation:
#### "Duration" and "Calories" got a 0.922721 correlation, which is a very good correlation, and we can predict that the longer you work out, the more calories you burn, and the other way around: if you burned a lot of calories, you probably had a long work out.

#### Bad Correlation:
#### "Duration" and "Maxpulse" got a 0.009403 correlation, which is a very bad correlation, meaning that we can not predict the max pulse by just looking at the duration of the work out, and vice versa.

