# Data Parsing on .services Extension Files

# Table of Content
1. [A Service File](#a-service-file)
2. [Problem Statement](#problem-statement)
3. [Open, Parse & Access Files](#open,-parse-&-access-files)
4. [Data Cleaning](#data-cleaning)
5. [Note:](#note:)

### A Service File

A **SERVICE** file is a service unit file included with systemd, an init (initialization) system used by various Linux distributions to bootstrap user space and manage processes. 

It contain information about how to manage a server application or service. [Read More](https://fileinfo.com/extension/service)....


### Problem Statement

Although these service files may be available, opening and accessing the contents of these files can be quite difficult. Very few or no applications (depending on your location/ country) can open a service file and access its content.

This notebook applies the data parsing capability of python libraries to:
* access
* open
* clean and 
* write the data contained in several service files to a readable format like csv.

### Open, Parse & Access Files

In [1]:
#Import required libraries
import pandas as pd
import os
import regex as re
import numpy as np

In [2]:
#Assign file directory to variable pw
pw = "/Users/elizabethofulue/Downloads/Inventory/"

In [3]:
#Create function to access extension files of interest

def access_file(p):
    
    #use the os library to access files in directory
    path = os.getcwd()
    files = os.listdir(p)
    
    #Access all files with .services extension
    fa = [s for s in files if s[-8:] == 'services']
    return fa
    

In [None]:
#Apply the access_file function to pw
ad = access_file(pw)
ad

In [5]:
#Create a function with one variable to parse the contents of each extension file into a dataframe

def open_line(l):
    #Create an empty list
    line = []
    #Loop through and open the content of every files in fa
    for filename in l:
        with open(pw+filename, 'r', encoding='utf16') as re:
            ''' Read and append every line to empty list
            This is done to ensure that every line is read as a row into a dataframe
            '''
            line.append(re.readlines())
            #Create dataframe from list
    ad = pd.DataFrame(line)
    return(ad)


In [6]:
#Apply open_line function to dataframe
df = open_line(ad)
df.head(1)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,321,322,323,324,325,326,327,328,329,330
0,"""BNY-D-3775"";""Cisco AnyConnect Secure Mobility...","""BNY-D-3775"";""AllJoyn Router Service"";""AllJoyn...","""BNY-D-3775"";""Application Layer Gateway Servic...","""BNY-D-3775"";""AMD Crash Defender Service"";""AMD...","""BNY-D-3775"";""Application Identity"";""Applicati...","""BNY-D-3775"";""Application Information"";""Applic...","""BNY-D-3775"";""Application Management"";""Applica...","""BNY-D-3775"";""App Readiness"";""App Readiness"";""...","""BNY-D-3775"";""Microsoft App-V Client"";""Microso...","""BNY-D-3775"";""AppX Deployment Service (AppXSVC...",...,,,,,,,,,,


In [7]:
#Check dataframe for null values
df.isna().sum()

0        1
1        1
2        1
3        1
4        1
      ... 
326    526
327    526
328    526
329    526
330    526
Length: 331, dtype: int64

In [8]:
#Review the shape of the dataframe
df.shape

(527, 331)

The following were observed from the dataframe:

* Some columns do contain null values which need to be dropped.
* The values in each colmn are encased in double quotes - **""**. These will be removed.
* Each row in per column contains more than one value, separated by **;**. Eech row per column will be splitted by the seperatorand appended to the rows of a new dataframe.

### Data Cleaning

In [None]:
#Strip dataframe of double quotes
dff = df.apply(lambda x: x.str.replace('"', ''), axis = 0)
dff.head()

In [10]:
#Create a function to expand and split values per column in dataframe

def expand_df(df):
    #Initialize empty dataframwe
    sas = pd.DataFrame()
    
    #Loop through colname and values 
    for (colname,colval) in df.items():
        if colname <= 330:
            vt = colval.str.split(';', expand=True)
            
            #append split values to empty dataframe
            sas = sas.append(vt)
            
            #return clean dataframe
    return sas

In [None]:
#Apply expand_df function to dataframe
dff = expand_df(dff)

#Review the first 5 rows
dff.head()

In [None]:
#Review value counts of the file names 
dff[0].value_counts()

BNY-D-3333    331
ABJ-D-0006    320
BNY-D-3768    320
ABJ-L-0135    316
ABJ-L-0099    316
             ... 
ABJ-S-303     166
ABJ-S-304     164
ABJ-S-306     163
ABJ-S-301     163
ABJ-S-305     156
Name: 0, Length: 526, dtype: int64

In [None]:
#Review info of the dataframe
dff.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 174437 entries, 0 to 526
Data columns (total 11 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   0       154906 non-null  object
 1   1       154906 non-null  object
 2   2       154906 non-null  object
 3   3       154906 non-null  object
 4   4       154906 non-null  object
 5   5       154906 non-null  object
 6   6       154906 non-null  object
 7   7       154906 non-null  object
 8   8       154906 non-null  object
 9   9       154906 non-null  object
 10  10      154906 non-null  object
dtypes: object(11)
memory usage: 16.0+ MB


In [None]:
#Confirm the value of null rows per column
dff.isna().sum()

0     19531
1     19531
2     19531
3     19531
4     19531
5     19531
6     19531
7     19531
8     19531
9     19531
10    19531
dtype: int64

* Although there are **174,437** entries, all columns have **154,906** non-null objects. This indicates the presence of null rows. These null rows will be removed as they do not cantain any relevant data.
* Column 9 is a duplicate of column 0 while column 10 is populated with **\n**. These columns can be removed as they are no longer relevant.

In [None]:
#Create a function to clean the dataframe
def clean_df(df):
    
    #drop null rows
    df.dropna(inplace= True)
    
    #Drop unwanted columns
    df.drop([9, 10], inplace = True, axis = 1)
    
    #Rename column 0 to id
    df.rename(columns={0:'id'}, inplace=True)
    
    #Return clean dataframe
    return df.head()
    

In [None]:
#Apply clean_df function to dff
mdf = clean_df(dff)

#Review the first 5 rows
mdf.head()

### Writing to CSV

In [None]:
#Write clean mdf dataframe to csv file
mdf.to_csv('services.csv')

### Note:
Not every cleaning step may be applicable to the content of every .services file. However, the lines of code used to parse and access the contents of every file, remains uniform amongst most.