## Problem Statement
For this project you have 4 files containing information about persons.

The files are:

- `personal_info.csv` - personal information such as name, gender, etc. (one row per person)
- `vehicles.csv` - what vehicle people own (one row per person)
- `employment.csv` - where a person is employed (one row per person)
- `update_status.csv` - when the person's data was created and last updated
Each file contains a key, SSN, which uniquely identifies a person.

This key is present in all four files.

You are guaranteed that the same SSN value is present in every file, and that it only appears once per file.

In addition, the files are all sorted by SSN, i.e. the SSN values appear in the same order in each file.

### Goal 1
Your first task is to create iterators for each of the four files that contained cleaned up data, of the correct type (e.g. string, int, date, etc), and represented by a named tuple.

For now these four iterators are just separate, independent iterators.

In [1]:
# Standard Library Imports
import csv
import datetime
from collections import namedtuple, Counter
from itertools import islice
from collections.abc import Iterator, Iterable

In [2]:
# NamedTuples used for casting data
SSN = namedtuple('SSN', "AreaNumber GroupNumber SerialNumber")
Date = namedtuple('Date', "month day year")
DateTime = namedtuple('DateTime', "Year Month Day Hour Minute Second")

In [3]:
# Casted Data Format
personal_info_data_types = ['SSN', 'STRING', 'STRING', 'STRING', 'STRING']
vehicle_info_data_types = ['SSN', 'STRING', 'STRING', 'INT']
employment_info_data_types = ['STRING', 'STRING', 'STRING', 'SSN']
update_info_data_types = ['SSN', 'DateTime', 'DateTime']

In [4]:
class DataIterator:
    def __init__(self, fname, data_category, expected_data_types):
        self._fname = fname
        self._f = None
        self.headers = None
        self._data_category = data_category
        self._namedtuple = None
        self.expected_data_types = expected_data_types

        # Read and cast the data in proper namedtuple format
        self._f = DataIterator.read_file(self._fname)
        self.headers = next(self._f)
        self._namedtuple = namedtuple(self._data_category, self.headers)
        self.casted_data = (DataIterator.cast_row(self._namedtuple, row, self.expected_data_types) for row in self._f)
    
    def __iter__(self):
        return self
    
    def __next__(self):
        return next(self.casted_data)

    def __enter__(self):
        return self
    
    def __exit__(self, exc_type, exc_value, exc_tb):
        return False
    
    @staticmethod
    def read_file(filename):
        """
        Function to read the csv file
        :param filename: Name of the file
        :return: Row from the file
        """
        with open(filename) as file:
            rows = csv.reader(file, delimiter=',', quotechar='"')
            yield from rows

    @staticmethod
    def cast_row(named_tuple, data_row, data_type):
        """
        Function to return the casted value of the data row
        :param named_tuple: Instance of namedtuple in which the output is expected
        :param data_row: Input data row
        :param data_type: Expected data type
        :return: Data row in the converted format
        """
        casted_data = (DataIterator.cast(data_type, value) for data_type, value in zip(data_type, data_row))

        _data = named_tuple(*casted_data)
        return _data

    @staticmethod
    def cast(data_type, value):
        """
        Function to cast appropriate data type for the input value
        :param data_type: Expected data type
        :param value: Input value
        :return: Converted value in the data_type
        """
        if data_type == 'STRING':
            return str(value)
        elif data_type == 'INT':
            return int(value)
        elif data_type == 'DATE':
            value = value.split('/')
            return Date(*value)
        elif data_type == 'SSN':
            value = value.split('-')
            return SSN(*value)
        elif data_type == 'DateTime':
            _date, _time = value.split('T')
            _date = _date.split('-')
            _time_data, _ = _time.split('Z')
            _time = _time_data.split(':')
            _datetime_date = _date + _time
            _format = [int(element) for element in _datetime_date]
            return datetime.datetime(*_format)

In [5]:
# Create Iterator for all four files
personal_info_iterator = DataIterator('personal_info.csv', 'PersonalInfo', personal_info_data_types)
vehicle_info_iterator = DataIterator('vehicles.csv', 'VehicleInfo', vehicle_info_data_types)
employment_info_iterator = DataIterator('employment.csv', 'EmploymentInfo', employment_info_data_types)
update_info_iterator = DataIterator('update_status.csv', 'UpdateInfo', update_info_data_types)

# Check if instances are Iterators
print(f"Is `personal_info_iterator` an Iterator: {isinstance(personal_info_iterator, Iterator)}")
print(f"Is `vehicle_info_iterator` an Iterator: {isinstance(vehicle_info_iterator, Iterator)}")
print(f"Is `employment_info_iterator` an Iterator: {isinstance(employment_info_iterator, Iterator)}")
print(f"Is `update_info_iterator` an Iterator: {isinstance(update_info_iterator, Iterator)}")
print("  ")

# Print Sample Data from each Iterator
[print(row) for row in islice(personal_info_iterator, 5)]
print("  ")
[print(row) for row in islice(vehicle_info_iterator, 5)]
print("  ")
[print(row) for row in islice(employment_info_iterator, 5)]
print("  ")
[print(row) for row in islice(update_info_iterator, 5)]

Is `personal_info_iterator` an Iterator: True
Is `vehicle_info_iterator` an Iterator: True
Is `employment_info_iterator` an Iterator: True
Is `update_info_iterator` an Iterator: True
  
PersonalInfo(ssn=SSN(AreaNumber='100', GroupNumber='53', SerialNumber='9824'), first_name='Sebastiano', last_name='Tester', gender='Male', language='Icelandic')
PersonalInfo(ssn=SSN(AreaNumber='101', GroupNumber='71', SerialNumber='4702'), first_name='Cayla', last_name='MacDonagh', gender='Female', language='Lao')
PersonalInfo(ssn=SSN(AreaNumber='101', GroupNumber='84', SerialNumber='0356'), first_name='Nomi', last_name='Lipprose', gender='Female', language='Yiddish')
PersonalInfo(ssn=SSN(AreaNumber='104', GroupNumber='22', SerialNumber='0928'), first_name='Justinian', last_name='Kunzelmann', gender='Male', language='Dhivehi')
PersonalInfo(ssn=SSN(AreaNumber='104', GroupNumber='84', SerialNumber='7144'), first_name='Claudianus', last_name='Brixey', gender='Male', language='Afrikaans')
  
VehicleInfo(ssn

[None, None, None, None, None]

---

### Goal 2
Create a single iterable that combines all the columns from all the iterators.

The iterable should yield named tuples containing all the columns. Make sure that the SSN's across the files match!

All the files are guaranteed to be in SSN sort order, and every SSN is unique, and every SSN appears in every file.

Make sure the SSN is not repeated 4 times - one time per row is enough!

In [6]:
# Create Iterator for all four files
personal_info_iterator = DataIterator('personal_info.csv', 'PersonalInfo', personal_info_data_types)
vehicle_info_iterator = DataIterator('vehicles.csv', 'VehicleInfo', vehicle_info_data_types)
employment_info_iterator = DataIterator('employment.csv', 'EmploymentInfo', employment_info_data_types)
update_info_iterator = DataIterator('update_status.csv', 'UpdateInfo', update_info_data_types)

# Extract each row from the Iterator
combined_fields = []
row1 = next(personal_info_iterator)
row2 = next(vehicle_info_iterator)
row3 = next(employment_info_iterator)
row4 = next(update_info_iterator)

# Collect all the field names from the namedtuple data
[combined_fields.append(field) for field in row1._fields]
[combined_fields.append(field) for field in row2._fields]
[combined_fields.append(field) for field in row3._fields]
[combined_fields.append(field) for field in row4._fields]

# Remove the repeated fileds 
combined_fields = set(combined_fields)
print(combined_fields)

# Create NamedTuple for Combined Data
CombinedData = namedtuple('CombinedData', combined_fields)

{'last_name', 'language', 'employer', 'last_updated', 'gender', 'department', 'model_year', 'employee_id', 'ssn', 'vehicle_make', 'created', 'first_name', 'vehicle_model'}


In [7]:
# Create Iterator for all four files
with DataIterator('personal_info.csv', 'PersonalInfo', personal_info_data_types) as personal_info_iterator:
    with DataIterator('vehicles.csv', 'VehicleInfo', vehicle_info_data_types) as vehicle_info_iterator:
        with DataIterator('employment.csv', 'EmploymentInfo', employment_info_data_types) as employment_info_iterator:
            with DataIterator('update_status.csv', 'UpdateInfo', update_info_data_types) as update_info_iterator:
                # List to store the combined data
                combined_data = []

                # Iterate over each row in all the data
                for personal_data, vehicle_data, employment_data, update_data in zip(personal_info_iterator, vehicle_info_iterator, employment_info_iterator, update_info_iterator):
                    # Dictionary to store all the data
                    temp = dict()

                    # SSN of a person from personal information
                    _ssn = personal_data.ssn

                    # Store each type of a data in a separate temporary dictionary
                    temp_1 = {field: getattr(personal_data, field) for field in personal_data._fields}
                    temp_2 = {field: getattr(vehicle_data, field) for field in vehicle_data._fields if vehicle_data.ssn == _ssn}
                    temp_3 = {field: getattr(employment_data, field) for field in employment_data._fields if employment_data.ssn == _ssn}
                    temp_4 = {field: getattr(update_data, field) for field in update_data._fields if update_data.ssn == _ssn}
                    
                    # Combine 4 different dictionaries into 1 dictionary
                    for data in (temp_1, temp_2, temp_3, temp_4):
                        temp.update(data)

                    # Convert the data into NamedTuple
                    combined_data.append(CombinedData(**temp))

                print(len(combined_data))

1000


In [8]:
# Create an Iterable from Generator to yield the data
class DataIterable:
    def __init__(self, n):
        self._n = n
        self._dataset = combined_data

    def __len__(self):
        return len(self._dataset)

    def __iter__(self):
        return DataIterable.fetch_data(self._n)

    @staticmethod
    def fetch_data(n):
        for i in range(n):
            yield combined_data[i]

In [9]:
combined_iterable = DataIterable(5)
print(type(combined_iterable))
print(f"Is `combined_iterable` object and Iterable: {isinstance(combined_iterable, Iterable)}")

<class '__main__.DataIterable'>
Is `combined_iterable` object and Iterable: True


In [10]:
[data for data in combined_iterable]

[CombinedData(last_name='Tester', language='Icelandic', employer='Stiedemann-Bailey', last_updated=datetime.datetime(2017, 10, 7, 0, 14, 42), gender='Male', department='Research and Development', model_year=1993, employee_id='29-0890771', ssn=SSN(AreaNumber='100', GroupNumber='53', SerialNumber='9824'), vehicle_make='Oldsmobile', created=datetime.datetime(2016, 1, 24, 21, 19, 30), first_name='Sebastiano', vehicle_model='Bravada'),
 CombinedData(last_name='MacDonagh', language='Lao', employer='Nicolas and Sons', last_updated=datetime.datetime(2017, 1, 23, 11, 23, 17), gender='Female', department='Sales', model_year=1997, employee_id='41-6841359', ssn=SSN(AreaNumber='101', GroupNumber='71', SerialNumber='4702'), vehicle_make='Ford', created=datetime.datetime(2016, 1, 27, 4, 32, 57), first_name='Cayla', vehicle_model='Mustang'),
 CombinedData(last_name='Lipprose', language='Yiddish', employer='Connelly Group', last_updated=datetime.datetime(2017, 10, 4, 11, 21, 30), gender='Female', depar

In [11]:
[data for data in combined_iterable]

[CombinedData(last_name='Tester', language='Icelandic', employer='Stiedemann-Bailey', last_updated=datetime.datetime(2017, 10, 7, 0, 14, 42), gender='Male', department='Research and Development', model_year=1993, employee_id='29-0890771', ssn=SSN(AreaNumber='100', GroupNumber='53', SerialNumber='9824'), vehicle_make='Oldsmobile', created=datetime.datetime(2016, 1, 24, 21, 19, 30), first_name='Sebastiano', vehicle_model='Bravada'),
 CombinedData(last_name='MacDonagh', language='Lao', employer='Nicolas and Sons', last_updated=datetime.datetime(2017, 1, 23, 11, 23, 17), gender='Female', department='Sales', model_year=1997, employee_id='41-6841359', ssn=SSN(AreaNumber='101', GroupNumber='71', SerialNumber='4702'), vehicle_make='Ford', created=datetime.datetime(2016, 1, 27, 4, 32, 57), first_name='Cayla', vehicle_model='Mustang'),
 CombinedData(last_name='Lipprose', language='Yiddish', employer='Connelly Group', last_updated=datetime.datetime(2017, 10, 4, 11, 21, 30), gender='Female', depar

---

### Goal 3
Next, you want to identify any stale records, where stale simply means the record has not been updated since 3/1/2017 (e.g. last update date < 3/1/2017). Create an iterator that only contains current records (i.e. not stale) based on the last_updated field from the status_update file.

In [12]:
total_dataset = DataIterable(len(combined_data))

In [13]:
# List to store the stale records
stale_records = []
stale_indexes = []
stale_threshold = datetime.datetime(2017, 3, 1, 00, 00, 00)

In [14]:
# Filter out all the stale data and stale indexes
for index, data in enumerate(total_dataset):
    if data.last_updated < stale_threshold:
        stale_records.append(data)
        stale_indexes.append(index)

print(f"Dataset Size: {len(total_dataset)}")

Dataset Size: 1000


In [15]:
# Create current data by removing the stale data from the combined dataset
print(f"Total Stale Records: {len(stale_indexes)}")

current_records = (data for data in total_dataset if data not in stale_records)

print(f"Is current record an Iterator: {isinstance(current_records, Iterator)}")

Total Stale Records: 129
Is current record an Iterator: True


---

### Goal 4
Find the largest group of car makes for each gender.

Possibly more than one such group per gender exists (equal sizes).

In [16]:
# Largest Group of car for each gender in total dataset
vehicle_make_data_male = []
vehicle_make_data_female = []

for data in total_dataset:
    if data.gender == 'Male':
        vehicle_make_data_male.append(data.vehicle_make)
    elif data.gender == "Female":
        vehicle_make_data_female.append(data.vehicle_make)

male_counter = Counter(vehicle_make_data_male)
max_count_vehicle_make_male = max(male_counter.values())
largest_group_of_car_for_male = [key for key, value in male_counter.items() if value == max_count_vehicle_make_male]
print(f"Largest Group of Car for Male is {largest_group_of_car_for_male} with a number of {max_count_vehicle_make_male}")

female_counter = Counter(vehicle_make_data_female)
max_count_vehicle_make_female = max(female_counter.values())
largest_group_of_car_for_female = [key for key, value in female_counter.items() if value == max_count_vehicle_make_female]
print(f"Largest Group of Car for Female is {largest_group_of_car_for_female} with a number of {max_count_vehicle_make_female}")

Largest Group of Car for Male is ['Ford'] with a number of 44
Largest Group of Car for Female is ['Ford', 'Chevrolet'] with a number of 48


In [17]:
# Largest Group of car for each gender in current dataset
vehicle_make_data_male = []
vehicle_make_data_female = []

for data in current_records:
    if data.gender == 'Male':
        vehicle_make_data_male.append(data.vehicle_make)
    elif data.gender == "Female":
        vehicle_make_data_female.append(data.vehicle_make)

male_counter = Counter(vehicle_make_data_male)
max_count_vehicle_make_male = max(male_counter.values())
largest_group_of_car_for_male = [key for key, value in male_counter.items() if value == max_count_vehicle_make_male]
print(f"Largest Group of Car for Male is {largest_group_of_car_for_male} with a number of {max_count_vehicle_make_male}")

female_counter = Counter(vehicle_make_data_female)
max_count_vehicle_make_female = max(female_counter.values())
largest_group_of_car_for_female = [key for key, value in female_counter.items() if value == max_count_vehicle_make_female]
print(f"Largest Group of Car for Female is {largest_group_of_car_for_female} with a number of {max_count_vehicle_make_female}")

Largest Group of Car for Male is ['Ford'] with a number of 40
Largest Group of Car for Female is ['Chevrolet', 'Ford'] with a number of 42
