# Duplicate values Analysis

## Define 
1. Repeated value / Copied


## Reason 
1. Human Error 
   - Entering for multiple time 
2. System Glitches 
3. Merging the Data 

Duplicate values in data can arise from several common reasons. Let's break down each reason with simple explanations and sample data:

### 1. Human Error
Human errors can cause duplicate entries, often when data is manually entered into a system.

**Example:**
Suppose an employee is entering sales records into a database. They might accidentally enter the same sale twice.

**Sample Data:**
| Sale_ID | Product | Amount | Date       |
|---------|---------|--------|------------|
| 001     | Widget  | $100   | 2024-05-01 |
| 002     | Gadget  | $150   | 2024-05-02 |
| 003     | Widget  | $100   | 2024-05-01 |  <- Duplicate due to human error

### 2. System Glitches
System glitches or bugs can cause the same data to be recorded multiple times.

**Example:**
An automated system might fail to confirm that a record has been successfully saved, leading it to try saving the same record multiple times.

**Sample Data:**
| Transaction_ID | User   | Amount | Timestamp           |
|----------------|--------|--------|---------------------|
| A123           | Alice  | $200   | 2024-05-03 10:00:00 |
| A124           | Bob    | $300   | 2024-05-03 11:00:00 |
| A123           | Alice  | $200   | 2024-05-03 10:00:00 |  <- Duplicate due to system glitch

### 3. Merging the Data
When merging data from different sources, duplicates can occur if there is no proper mechanism to identify and eliminate them.

**Example:**
Combining customer data from two different databases without a unique identifier to check for duplicates.

**Sample Data:**
**Database 1:**
| Customer_ID | Name   | Email             |
|-------------|--------|-------------------|
| C001        | John   | john@example.com  |
| C002        | Jane   | jane@example.com  |

**Database 2:**
| Customer_ID | Name   | Email             |
|-------------|--------|-------------------|
| C001        | John   | john@example.com  |
| C003        | Alice  | alice@example.com |

**Merged Data:**
| Customer_ID | Name   | Email             |
|-------------|--------|-------------------|
| C001        | John   | john@example.com  |
| C002        | Jane   | jane@example.com  |
| C001        | John   | john@example.com  |  <- Duplicate due to merging without proper checks
| C003        | Alice  | alice@example.com |

### Summary
- **Human Error**: Occurs when individuals input the same data more than once.
- **System Glitches**: Happen due to bugs or errors in automated systems causing repeated entries.
- **Merging the Data**: Arises when combining datasets without proper de-duplication mechanisms.

By understanding these reasons, steps can be taken to mitigate duplicates, such as implementing validation checks, improving system reliability, and using proper data merging techniques.

## Impact 
1. Time Complexity
2. Space Complexity

### Impact of Duplicates on Time and Space Complexity

Duplicates in data can have significant impacts on both time complexity and space complexity. Let's explain these impacts with simple explanations and sample data.

### 1. Time Complexity
Time complexity refers to the amount of time it takes to run an algorithm as a function of the length of the input.

**Impact of Duplicates:**
- **Searching and Processing**: When data contains duplicates, the time taken to search, sort, or process data increases because each duplicate needs to be handled.
- **De-duplication**: Identifying and removing duplicates itself adds overhead to the process, increasing the time complexity.

**Example:**
Suppose we have a list of items and we want to find if a particular item exists in the list.

**Sample Data:**
| Item_ID | Item_Name |
|---------|-----------|
| 001     | Apple     |
| 002     | Banana    |
| 003     | Apple     |  <- Duplicate
| 004     | Orange    |

**Searching Without Duplicates:**
If we search for "Apple", we might find it in \( O(1) \) time in an ideal case with no duplicates.

**Searching With Duplicates:**
With duplicates, the search could take longer since each occurrence must be checked, potentially increasing the time complexity to \( O(n) \) in the worst case.

**De-duplication Process:**
To remove duplicates:
1. Traverse the list \( O(n) \)
2. Use a set to keep track of unique items \( O(1) \) per insertion

Overall de-duplication time complexity: \( O(n) \)

### 2. Space Complexity
Space complexity refers to the amount of memory an algorithm needs to run as a function of the length of the input.

**Impact of Duplicates:**
- **Storage**: Storing duplicates takes up additional space, leading to higher memory usage.
- **De-duplication**: The process of de-duplication may require additional data structures, like sets or hash tables, which also consume memory.

**Example:**
Consider a database where customer records are stored.

**Sample Data:**
| Customer_ID | Name   | Email             |
|-------------|--------|-------------------|
| 001         | John   | john@example.com  |
| 002         | Jane   | jane@example.com  |
| 001         | John   | john@example.com  |  <- Duplicate

**Storage Without Duplicates:**
Storing unique records:
- Space complexity is \( O(n) \)

**Storage With Duplicates:**
Storing duplicates increases the space required:
- Space complexity becomes \( O(n + m) \), where \( m \) is the number of duplicates

**De-duplication Process:**
To remove duplicates:
1. Use a set or hash table to track seen items
2. These structures require additional space \( O(n) \)

### Summary
- **Time Complexity**: Duplicates increase the time needed for searching, sorting, and processing data. De-duplication processes add further time complexity.
- **Space Complexity**: Duplicates consume additional storage space. De-duplication requires extra memory for auxiliary data structures like sets or hash tables.

By understanding these impacts, we can design algorithms and systems that better handle duplicates, optimizing both time and space efficiency.

## Identification 

### df.duplicated()
1. rows - Duplicates 
2. column - Zero Variance

# Duplicates 

In [1]:
import pandas as pd 
import numpy as np 

In [14]:
df = pd.DataFrame ({'Name': ['Rajesh','Ramesh','Suresh','Ramesh','Ramesh','Naresh'],
              'Salary':[23000,40000,24000,40000,40000,34000],
              'Company': ['IBM','IBM','IBM','IBM','IBM','IBM']})
df

Unnamed: 0,Name,Salary,Company
0,Rajesh,23000,IBM
1,Ramesh,40000,IBM
2,Suresh,24000,IBM
3,Ramesh,40000,IBM
4,Ramesh,40000,IBM
5,Naresh,34000,IBM


In [15]:
# Identification of duplicate

In [16]:
df.duplicated().sum()

2

## Treatment of Duplicates 

1. Drop the Duplicates 
2.  Creating Separate Data Frames 

In [17]:
# Creating Data Frame and Flagging 
x = df[df.duplicated()]
x

Unnamed: 0,Name,Salary,Company
3,Ramesh,40000,IBM
4,Ramesh,40000,IBM


In [18]:
# Save in separate csv file 
x.to_csv('Duplicated.csv')

In [19]:
# Dropping the Duplicates 
df.drop_duplicates(inplace = True)

In [20]:
df

Unnamed: 0,Name,Salary,Company
0,Rajesh,23000,IBM
1,Ramesh,40000,IBM
2,Suresh,24000,IBM
5,Naresh,34000,IBM


### Keep 

In [21]:
df = pd.DataFrame ({'Name': ['Rajesh','Ramesh','Suresh','Ramesh','Ramesh','Naresh'],
              'Salary':[23000,40000,24000,40000,40000,34000],
              'Company': ['IBM','IBM','IBM','IBM','IBM','IBM']})
df

Unnamed: 0,Name,Salary,Company
0,Rajesh,23000,IBM
1,Ramesh,40000,IBM
2,Suresh,24000,IBM
3,Ramesh,40000,IBM
4,Ramesh,40000,IBM
5,Naresh,34000,IBM


In [22]:
df.drop_duplicates(keep = 'first')

Unnamed: 0,Name,Salary,Company
0,Rajesh,23000,IBM
1,Ramesh,40000,IBM
2,Suresh,24000,IBM
5,Naresh,34000,IBM


In [24]:
df.drop_duplicates(keep = 'last')

Unnamed: 0,Name,Salary,Company
0,Rajesh,23000,IBM
2,Suresh,24000,IBM
4,Ramesh,40000,IBM
5,Naresh,34000,IBM


### Zero Variance Features 

In [30]:
df = pd.DataFrame ({'Name': ['Rajesh','Ramesh','Suresh','Ramesh','Ramesh','Naresh'],
              'Salary':[23000,23000,23000,23000,23000,23000],
              'Company': ['IBM','IBM','IBM','IBM','IBM','IBM']})
df

Unnamed: 0,Name,Salary,Company
0,Rajesh,23000,IBM
1,Ramesh,23000,IBM
2,Suresh,23000,IBM
3,Ramesh,23000,IBM
4,Ramesh,23000,IBM
5,Naresh,23000,IBM


In [31]:
# Numerical 
df['Salary'].var()

0.0

In [32]:
#Categorical 
df['Company'].unique()

array(['IBM'], dtype=object)

In [33]:
#Categorical 
df['Company'].nunique()

1

In [35]:
df.drop('Salary',axis=1,inplace=True)

In [36]:
df

Unnamed: 0,Name,Company
0,Rajesh,IBM
1,Ramesh,IBM
2,Suresh,IBM
3,Ramesh,IBM
4,Ramesh,IBM
5,Naresh,IBM


# Type Casting 

In [37]:
#### Chainging the one data type to another data type 

In [39]:
df = pd.read_csv(r"C:\Users\91771\Desktop\Innomatic\EDA\Pandas\Datasets\titanic (1).csv")
df

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,0
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,1
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,1
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,0
...,...,...,...,...,...,...,...,...,...,...,...,...
895,896,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,0
896,897,3,"Alexander, Mr. William",male,26.0,0,0,3474,7.8875,,S,0
897,898,3,"Lester, Mr. James",male,39.0,0,0,A/4 48871,24.1500,,S,0
898,899,2,"Slemen, Mr. Richard James",male,35.0,0,0,28206,10.5000,,S,0


In [40]:
df.dtypes

PassengerId      int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
Survived         int64
dtype: object

In [46]:
df['Pclass']=df["Pclass"].astype('object')

In [47]:
df['Survived']=df["Survived"].astype('object')

In [48]:
df.dtypes

PassengerId      int64
Pclass          object
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
Survived        object
Pclass          object
dtype: object

In [50]:
df['Age'].min()

0.42

In [51]:
df['Age'].max()

7000.0

In [52]:
df['Age'].nbytes

7200

In [53]:
np.finfo('float64')

finfo(resolution=1e-15, min=-1.7976931348623157e+308, max=1.7976931348623157e+308, dtype=float64)

In [54]:
np.finfo('float16')

finfo(resolution=0.001, min=-6.55040e+04, max=6.55040e+04, dtype=float16)

In [None]:
# Converting Data Type to Reduce Memory Space 

In [55]:
df['Age']=df['Age'].astype('float16')

In [56]:
df['Age'].nbytes

1800

In [57]:
df['Age'].min()

0.42

In [58]:
df['Age'].max()

7000.0