# Email Domain and Validity Analysis

## Project Overview

This project addresses a common issue faced by organizations: identifying junk emails and domains that cause memory issues on email servers, which results in missing valid emails from genuine senders. The goal of this use case is to create a Python-based solution to identify and classify valid and invalid emails and domains from a mixed dataset.

## Use Case

The organization is encountering server memory problems due to an influx of emails from junk domains, causing critical emails to be missed. The objective is to create a temporary dataset containing both valid and invalid email addresses, then develop a Python script to analyze and resolve the following:

## Solution


1. **Data Preparation**:
   - Create a dataset of email addresses, including a mix of valid and invalid email addresses and domains.
   - Store this data in a file format that is easy to process (e.g., JSON or TXT).

* Code for this task has been written below.

In [26]:
# Import required module
import random 
import string 
import json

# Create sample firstname, lastname, and email. So the email generat
first_names = ["John", "Jane", "Michael", "Emily", "Christopher", "Olivia", "William", "Ava", "David", "Sophia", "Matthew", "Isabella", "James", "Mia", "Daniel", "Charlotte", "Joshua", "Amelia", "Christopher", "Evelyn", "Andrew", "Abigail", "Joseph", "Harper", "Thomas", "Emma", "Charles", "Addison", "George", "Brooklyn",'John', 'Jane', 'Mike', 'Sarah', 'Robert', 'Emily', 'Daniel', 'Sophia', 'David', 'Olivia']

last_names = ["Smith", "Johnson", "Williams", "Brown", "Jones", "Miller", "Davis", "Garcia", "Rodriguez", "Wilson", "Martinez", "Anderson", "Taylor", "Thomas", "Moore", "Jackson", "White", "Lee", "Harris", "Clark",'Smith', 'Johnson', 'Brown', 'Williams', 'Jones', 'Miller', 'Davis', 'Garcia', 'Rodriguez', 'Wilson']

domains = ["amazon.in","amazon.com","siemens.com","oracle.com","google.co.in","gmail.com", 'yahoo.com', 'outlook.com', 'hotmail.com', 'mail.com',"xyz.com","instant-help-tech.com","workfromhome-careers.com","apply-remotely-now.com","amazon-order-status-check.com","account-update-now.info","cheapmedications4you.com","netflix-billing-support.com"]

### Creating dataset of email address with invocation of funtion "generate_random_emails" and saving data as json file

In [28]:
# Function to generate random emails
def generate_random_emails(num_emails=5000):
    emails = []
    
    for _ in range(num_emails):
        fname = random.choice(first_names)
        lname = random.choice(last_names)
        
        # Randomly decide the format of the email
        email_format = random.choice([
            f"{fname.lower()}",
            f"{lname.lower()}",
            f"{fname.lower()}.{lname.lower()}",
            f"{fname.lower()}{random.randint(1, 31)}",  # Date
            f"{fname.lower()}{random.choice(['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec'])}",  # Month
            f"{fname.lower()}{random.randint(1900, 2024)}",  # Year
            f"{fname.lower()}.{lname.lower()}{random.randint(1, 31)}",  # fname.lastname + date
            f"{fname.lower()}.{lname.lower()}{random.randint(1900, 2024)}",  # fname.lastname + year
            f"{fname.lower()}.{lname.lower()}{random.choice(['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec'])}"  # fname.lastname + month
        ])
        
        # Randomly pick a domain
        domain = random.choice(domains)
        
        # Construct the email and add it to the list
        email = f"{email_format}@{domain}"
        emails.append(email)
    
    return emails

# Example of generating 5000 emails
random_emails = generate_random_emails()

with open("random_email.json","w") as file:
    file.write(json.dumps(random_emails, indent=4))


# This code automatically create json file with name "random_email.json" and write random email in it.

2. **Analysis and Processing**:
   - Use Python to read the dataset and process the email addresses.
   - Leverage regular expressions to distinguish between valid and invalid emails based on a typical email format (e.g., `username@domain.extension`).
   - Separate the domains from the email addresses and categorize them into valid and invalid domains.
   - Create a mapping between valid domains and their respective valid email addresses.

In [1]:
import json

In [2]:
# Create Tuple of valid email domain (Get this from company)
valid_email_domains = ("amazon.in","amazon.com","siemens.com","oracle.com","google.co.in","gmail.com", 
                       'yahoo.com', 'outlook.com', 'hotmail.com', 'mail.com')

In [3]:
# Read exported file(json, txt, xml, etc)
def read_file(file_path):
    file_extensions = (".json", ".txt")
    email_list =[]
    msg = None
    if file_path.endswith(file_extensions):
        with open(file_path, 'r') as file:
            if(file_path.endswith('.json')):
                email_list = json.load(file)
            else:
                email_list = file.read().split(',')
        msg = "Valid File Extension"
    else:
        msg = "Invalid file extension."

    return email_list, msg

In [4]:
# Path of randomly generated file stored
email_list, msg = read_file('E:/Python-learning/Python_Learning/data/random_email.json')


In [9]:
# Distinguish between valid and invalid emails based on a typical email format
def email_detaction_analysis(valid_email_domains, email_list):
    domain_email_mapper = {}
    valid_emails_list = []
    invalid_email_list = []
    temp_valid_emails = []

    # Check each email against the valid domains
    for email in email_list:
        valid = False  # Flag to track if an email is valid
        for domain in valid_email_domains:
            if email.endswith(domain):
                valid = True
                if email not in temp_valid_emails:
                    temp_valid_emails.append(email)
                break  # No need to check further domains if valid

        if not valid:  # If not valid, add to invalid list
            if email not in invalid_email_list:
                invalid_email_list.append(email)

        valid_emails_list.extend(temp_valid_emails)

        # Initialize the domain in the mapper if not already present
        if domain not in domain_email_mapper:
            domain_email_mapper[domain] = []
        domain_email_mapper[domain].extend(temp_valid_emails)

        # Clear temp_valid_emails for the next iteration
        temp_valid_emails.clear()
        
    return valid_emails_list, invalid_email_list, domain_email_mapper

# Calling Function
valid_emails_list, invalid_email_list, domain_email_mapper = email_detaction_analysis(valid_email_domains, email_list)


3. **Result and Outcomes**:
    - List of valid emails: A clear list of emails that conform to the expected format and originate from trusted domains.
    - List of valid domains: A list of recognized domains that are identified as legitimate sources of communication.
    - List of invalid emails: A collection of emails that do not follow proper formatting or come from spammy sources.
    - List of invalid domains: A set of domains associated with junk emails or potential threats.
    - Mapping of valid domains to emails: A dictionary or mapping that connects valid domains to the corresponding valid emails.

In [10]:
# List of valid emails: 
print(valid_emails_list[:20])

['daniel.wilson31@oracle.com', 'harper.taylorjun@amazon.com', 'james.johnson2015@hotmail.com', 'christopher19@amazon.in', 'robert@hotmail.com', 'sophia.garcia@amazon.in', 'olivia.brown1919@siemens.com', 'wilson@oracle.com', 'sophia.miller22@hotmail.com', 'joseph.jonessep@hotmail.com', 'daniel28@hotmail.com', 'janeoct@oracle.com', 'evelyn1900@siemens.com', 'amelia.williamsoct@google.co.in', 'jane.rodriguez1943@mail.com', 'emma.wilson@oracle.com', 'ava.garcia1@google.co.in', 'john1915@outlook.com', 'davidmay@amazon.in', 'rodriguez@google.co.in']


In [11]:
# List of valid domains:
valid_domains = list(domain_email_mapper.keys())
print(valid_domains)

['mail.com', 'oracle.com', 'amazon.com', 'hotmail.com', 'amazon.in', 'siemens.com', 'google.co.in', 'outlook.com', 'gmail.com', 'yahoo.com']


In [12]:
# List of invalid emails:
print(invalid_email_list[:20])

['matthew.martinez23@netflix-billing-support.com', 'david13@account-update-now.info', 'georgedec@netflix-billing-support.com', 'brooklyn.smith11@workfromhome-careers.com', 'mia13@instant-help-tech.com', 'christopher5@xyz.com', 'matthewaug@apply-remotely-now.com', 'sophia1931@apply-remotely-now.com', 'christopher.lee1908@instant-help-tech.com', 'charlotte.white@apply-remotely-now.com', 'taylor@netflix-billing-support.com', 'michael.rodriguez@amazon-order-status-check.com', 'thomas@amazon-order-status-check.com', 'olivia@workfromhome-careers.com', 'david.rodriguez1939@amazon-order-status-check.com', 'christopher3@apply-remotely-now.com', 'john.miller28@cheapmedications4you.com', 'robert.smith22@account-update-now.info', 'david@amazon-order-status-check.com', 'taylor@instant-help-tech.com']


In [13]:
# List of invalid domains:
invalid_domain = []
for email in invalid_email_list:
    domain = email.split('@')[1]
    if domain not in invalid_domain:
        invalid_domain.append(domain)

print(invalid_domain)

['netflix-billing-support.com', 'account-update-now.info', 'workfromhome-careers.com', 'instant-help-tech.com', 'xyz.com', 'apply-remotely-now.com', 'amazon-order-status-check.com', 'cheapmedications4you.com']


In [14]:
# Mapping of valid domains to emails:
print(domain_email_mapper)

{'mail.com': ['jane.rodriguez1943@mail.com', 'christopherjul@mail.com', 'robert.garcia@mail.com', 'harper.clark@mail.com', 'clark@mail.com', 'ava1905@mail.com', 'johnson@mail.com', 'emma@mail.com', 'abigail@mail.com', 'david.moorefeb@mail.com', 'joseph.williams26@mail.com', 'emily1964@mail.com', 'mike.andersonnov@mail.com', 'thomas.harris1906@mail.com', 'sophia.johnsonaug@mail.com', 'amelia@mail.com', 'george.mooreoct@mail.com', 'william.johnson1928@mail.com', 'michael.andersonnov@mail.com', 'emily.lee3@mail.com', 'olivia.lee11@mail.com', 'harpernov@mail.com', 'john18@mail.com', 'john.taylorfeb@mail.com', 'robert1910@mail.com', 'rodriguez@mail.com', 'christopheraug@mail.com', 'david1919@mail.com', 'emilyoct@mail.com', 'emily.jackson17@mail.com', 'miller@mail.com', 'sarahmay@mail.com', 'sophia@mail.com', 'daniel10@mail.com', 'olivia.jackson1999@mail.com', 'matthew.martinez8@mail.com', 'sarah.taylor9@mail.com', 'emily12@mail.com', 'emilynov@mail.com', 'emily.davis31@mail.com', 'olivia.wh