# Data Parsing on .cpu Extension Files

# Table of Content
1. [A CPU File](#a-cpu-file)
2. [Problem Statement](#problem-statement)
3. [Open, Parse & Access Files](#open,-parse-&-access-files)
4. [Data Cleaning](#data-cleaning)
5. [Parse files with extended lengths](#parse-files-with-extended-lengths)
6. [Note:](#note:)

### A CPU File

Files with .cpu extension are known as Sysoft Sandra files. A cpu Sysoft Sandra file is a special file format and should only be edited and saved with the appropriate software. 

.cpu files are commonly used by Sandra, a software tool for Microsoft Windows operating system that allows users to test performance and get info about computer.[Read More](https://filext.com/file-extension/CPU)......

### Problem Statement

Although these .cpu files may be available, opening and accessing the data stored in these files can be quite difficult without the use of a specialized software/ application. Very few or no applications (depending on your location/ country) can open a .cpu file and access its content.

This notebook applies the data parsing capability of python libraries to:
* access
* open
* clean and 
* write the data contained of several .cpu files to a readable format like csv.

### Open, Parse & Access Files

In [1]:
#Import required libraries
import pandas as pd
import os
import regex as re
import numpy as np
from io import BytesIO
import dask.dataframe as dd


In [2]:
#Assign file directory to variable pw
pw = "/Users/elizabethofulue/Downloads/Inventory/"

In [3]:
#Create function to access .cpu extension files

def access_file(p):
    
    #use the os library to access files in directory
    path = os.getcwd()
    files = os.listdir(p)
    
    #Access all files with .services extension
    fx = [f for f in files if f[-3:] == 'cpu']
    return fx

In [None]:
#Apply the access_file function to pw
cp = access_file(pw)
cp

In [5]:
#Create a function with one variable to parse the contents of each extension file into a dataframe

def open_files(f):
    #Create an empty list
    line = []
    #Loop through and open the content of every files in fa
    for filename in f:
        with open(pw+filename, 'r', encoding='utf16') as re:
            ''' Read and append every line to empty list
            This is done to ensure that every line is read as a row into a dataframe
            '''
            line.append(re.readlines())
            #Create dataframe from list
    ct = pd.DataFrame(line)
    return(ct)

In [None]:
#Apply open_files function to dataframe
df = open_files(cp)
df.head()

In [7]:
#Review info on the values of the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 528 entries, 0 to 527
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       528 non-null    object
 1   1       527 non-null    object
 2   2       10 non-null     object
 3   3       3 non-null      object
 4   4       3 non-null      object
dtypes: object(5)
memory usage: 20.8+ KB


In [8]:
#Review the shape of the dataframe
df.shape

(528, 5)

In [9]:
#Check dataframe for null values
df.isna().sum()

0      0
1      1
2    518
3    525
4    525
dtype: int64

The following were observed from the dataframe:
* Athough there are 5 columns, some of them have quite a number of null rows. The dataframe will be split each column cleaned individually and grouped to account for columns with more null rows than some.
* The values in each colmn are encased in double quotes - **""**. These will be removed.
* Each row in per column contains more than one value, separated by **;**. Eech row per column will be splitted by the seperatorand appended to the rows of a new dataframe.

### Data Cleaning

In [None]:
#Strip dataframe of double quotes
dfa = df.apply(lambda x: x.str.replace('"', ''), axis = 0)
dfa

In [None]:
dfa[2].value_counts()

In [None]:
#Expand column 1
dof = dfa[0].str.split(';', expand = True)
dof.rename(columns={0:'id'}, inplace=True)
dof.drop(6, axis = 1, inplace=True)
dof

In [None]:
#Expand column 2
dif = dfa[1].str.split(';', expand = True)
dif.rename(columns={21:'id'}, inplace=True)
dif.drop([22,8], axis = 1, inplace=True)
dif

In [14]:
#Merge both dataframes to create first mdf
mdf = dof.merge(dif, on='id')


In [15]:
#Review the shape of new master dataframe
mdf.shape

(527, 26)

In [16]:
#Drop unwanted columns
mdf.drop(['3_x', '5_x'], axis = 1, inplace=True)

In [None]:
#Review the first 5 rows
mdf.head()


In [18]:
#  Write mdf to csv file
mdf.to_csv('cpu.csv')

### Parse files with extended lengths

In [None]:
#Expand column 3
dtf = dfa[2].str.split(';', expand = True)
dtf.rename(columns={21:'id'}, inplace=True)
dtf.drop([22,8], axis = 1, inplace=True)
dtf.dropna(axis=0, inplace=True)
dtf

In [None]:
#Expand column 5
dff = dfa[3].str.split(';', expand = True)
dff.rename(columns={21:'id'}, inplace=True)
dff.drop([22,8], axis = 1, inplace=True)
dff.dropna(axis=0, inplace=True)
dff

In [None]:
#Apppend dataframes
amdf = dtf.append(dff)
amdf

In [22]:
# Write amdf to csv file
amdf.to_csv('cpu2.csv')

### Note:
Not every cleaning step may be applicable to the content of every .cpu file. However, the lines of code used to parse and access the contents of every file, remains uniform amongst most.