# Defining problem statement
- Extract and analyse and visualise logging statements used in Apache Hadoop 

## Caveats in the project
- Highly open ended task
- Apache Hadoop project comprise of 7 different programming languages
- Inconsistancy of using logging statements in the code base
- logging levels and context of using them


## Approaches Used
- Considered only Java (92.8% of the code base)
- Understand why develpers need logging statemtents in code
- What are the importance of logging levels in logging statement
- What are the libraries or approaches of using logging statements in JAVA and which all libraries are used in this project
- Distribution of logging statement usages over libraries
- Most populat logging library and logging level
- Dependencies of logging level to 
    - number of lines of code in a file
    - context of using the log level - evaluate in which scenario they used this log level (evaluate previos code)
- **Where we can get the logging statements?**
    - Along with source code the development history(git log) provides valuable information about logging practices

## Some observations
- basic libraries are Log4j
- Abstraction libraries JCL and Slf4j
- Unification libraries Logback and Log4j2
- This given project(hadoop has used Log4j and slf4j in the code base)

## Step 1: Understanding importance of logging statements in codebase
- It helps the developers to trace the runtime behaviour 
- Logging statement practices are seems to be not very consistent
- In each code base there is trace of either **insufficient logging** or **excessive logging**
- The logging statements in the code base stores the timestamped log message to a prespecified log file
- One of the challenge that software engineer faces is that how to use most effective logging statements or in other words, maximise the value of the logged information while minimizing logging overhead
- proactive logging --> excessive logging

## Prior studies
- Mining logging code
- Mining logging message
- Automatic log insertion
- Learning to log

## How this analysis can help the software engineers
- There are many studies based on the logging practices because selection of logging statement is somethin the SE always find difficults
- Unfortunately, all the current researches are post development - this papers are focusing more on suggesting log statements on a realtime scenario

In [2]:
import pandas as pd


In [6]:
df = pd.read_fwf('text.txt', header = None)

In [7]:
df

Unnamed: 0,0,1
0,commit 5092ea62ecbac840d56978a31bb11cfc14c6fe83,
1,Author: Steve Loughran <stevel@cloudera.com>,
2,Date: Sat Aug 15 12:51:08 2020 +0100,
3,,
4,HADOOP-13230. S3A to optionally retain directo...,
5,,
6,"This adds an option to disable ""empty director...",
7,so avoid throttling and other scale problems.,
8,,
9,This feature is *not* backwards compatible.,


In [14]:
for i in df.iterrows():
    print(list(i[1]))
    

['commit 5092ea62ecbac840d56978a31bb11cfc14c6fe83', nan]
['Author: Steve Loughran <stevel@cloudera.com>', nan]
['Date:   Sat Aug 15 12:51:08 2020 +0100', nan]
[nan, nan]
['HADOOP-13230. S3A to optionally retain directory markers.', nan]
[nan, nan]
['This adds an option to disable "empty directory" marker deletion,', nan]
['so avoid throttling and other scale problems.', nan]
[nan, nan]
['This feature is *not* backwards compatible.', nan]
['Consult the documentation and use with care.', nan]
[nan, nan]
['Contributed by Steve Loughran.', nan]
[nan, nan]
['Change-Id: I69a61e7584dc36e485d5e39ff25b1e3e559a1958', nan]
[nan, nan]
['diff --git a/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/test/AssertExtensions.java', 'b/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/test/AssertExtensions.java']
['new file mode 100644', nan]
['index 00000000000..8c5e553f71e', nan]
['--- /dev/null', nan]
['+++ b/hadoop-common-project/hadoop-common/src/test/java/org/ap

['-          }', nan]
['-        } catch (AmazonClientException e) {', nan]
['-          throw translateException("getFileStatus", newKey, e);', nan]
['-        }', nan]
['+      } catch (AmazonClientException e) {', nan]
['+        throw translateException("getFileStatus", path, e);', nan]
['}', nan]
['}', nan]
[nan, nan]
['// execute the list', nan]
['if (probes.contains(StatusProbeEnum.List)) {', nan]
['try {', nan]
['+        // this will find a marker dir / as well as an entry.', nan]
['+        // When making a simple "is this a dir check" all is good.', nan]
['+        // but when looking for an empty dir, we need to verify there are no', nan]
['+        // children, so ask for two entries, so as to find', nan]
['+        // a child', nan]
['String dirKey = maybeAddTrailingSlash(key);', nan]
['-        S3ListRequest request = createListObjectsRequest(dirKey, "/", 1);', nan]
['+        // list size is dir marker + at least one non-tombstone entry', nan]
["+        // there's a co

['+      Path path = entry.getKey();', nan]
['+      if (policy.keepDirectoryMarkers(path)) {', nan]
["+        // there's a match", nan]
['+        // remove it from the map.', nan]
['+        entries.remove();', nan]
['+        LOG.debug("Removing {}", entry.getValue());', nan]
['+        removed.add(path);', nan]
['+      }', nan]
['+    }', nan]
['+    return removed;', nan]
['+  }', nan]
['+', nan]
['+  /**', nan]
['+   * This is a marker entry stored in the map and', nan]
['+   * returned as markers are deleted.', nan]
['+   */', nan]
['+  public static final class Marker {', nan]
['+    /** Path of the marker. */', nan]
['+    private final Path path;', nan]
['+', nan]
['+    /**', nan]
['+     * Key in the store.', nan]
['+     */', nan]
['+    private final String key;', nan]
['+', nan]
['+    /**', nan]
['+     * The file status of the marker.', nan]
['+     */', nan]
['+    private final S3ALocatedFileStatus status;', nan]
['+', nan]
['+    private Marker(final Path path,', 

['+   * @param dirMarkerTracker tracker of markers', nan]
['+   * @return how long it took.', nan]
['+   */', nan]
['+  private OperationDuration copyEmptyDirectoryMarkers(', nan]
['+      final String srcKey,', nan]
['+      final String dstKey,', nan]
['+      final DirMarkerTracker dirMarkerTracker) throws IOException {', nan]
['+    // directory marker work.', nan]
['+    LOG.debug("Copying markers from {}", dirMarkerTracker);', nan]
['+    final StoreContext storeContext = getStoreContext();', nan]
['+    Map<Path, DirMarkerTracker.Marker> leafMarkers =', nan]
['+        dirMarkerTracker.getLeafMarkers();', nan]
['+    Map<Path, DirMarkerTracker.Marker> surplus =', nan]
['+        dirMarkerTracker.getSurplusMarkers();', nan]
['+    // for all leaf markers: copy the original', nan]
['+    DurationInfo duration = new DurationInfo(LOG, false,', nan]
['+        "copying %d leaf markers with %d surplus not copied",', nan]
['+        leafMarkers.size(), surplus.size());', nan]
['+    fo

['+    // How many markers are expected?', nan]
['+    int expected = 0;', nan]
['+    String value = command.getOptValue(OPT_EXPECTED);', nan]
['+    if (value != null && !value.isEmpty()) {', nan]
['+      expected = Integer.parseInt(value);', nan]
['+    }', nan]
['+', nan]
['+    // determine the action', nan]
['+    boolean audit = command.getOpt(OPT_AUDIT);', nan]
['+    boolean clean = command.getOpt(OPT_CLEAN);', nan]
['+    if (audit == clean) {', nan]
['+      // either both are set or neither are set', nan]
['+      // this is equivalent to (not audit xor clean)', nan]
['+      errorln(getUsage());', nan]
['+      throw new ExitUtil.ExitException(EXIT_USAGE,', nan]
['+          "Exactly one of " + AUDIT + " and " + CLEAN);', nan]
['+    }', nan]
['+    int limit = UNLIMITED_LISTING;', nan]
['+    value = command.getOptValue(OPT_LIMIT);', nan]
['+    if (value != null && !value.isEmpty()) {', nan]
['+      limit = Integer.parseInt(value);', nan]
['+    }', nan]
['+    final S

['+2. You know all applications which read data from the bucket are compatible.', nan]
['+', nan]
['+### <a name="backups"></a> Applications backing up data.', nan]
['+', nan]
['+It is not enough to have a version of Apache Hadoop which is compatible, any', nan]
['+application which backs up an S3 bucket or copies elsewhere must have an S3', nan]
['+connector which is compatible. For the Hadoop codebase, that means that if', nan]
['+distcp is used, it _must_ be from a compatible hadoop version.', nan]
['+', nan]
['+### <a name="fallure-mode"></a> How will incompatible applications/versions fail?', nan]
['+', nan]
['+Applications using an incompatible version of the S3A connector will mistake', nan]
['+directories containing data for empty directories. This means that:', nan]
['+', nan]
['+* Listing directories/directory trees may exclude files which exist.', nan]
['+* Queries across the data will miss data files.', nan]
['+* Renaming a directory to a new location may exclude files unde

['+This adds extra overhead to every operation, but helps verify that the connector is', nan]
['+not keeping markers where it needs to be deleting them -and hence backwards compatibility', nan]
['+is maintained.', nan]
['+', nan]
['## <a name="scale"></a> Scale Tests', nan]
[nan, nan]
['There are a set of tests designed to measure the scalability and performance', nan]
['diff --git a/hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/AbstractS3ATestBase.java b/hado', 'p-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/AbstractS3ATestBase.java']
['index d94288dfc30..a2ee9ea5f7b 100644', nan]
['--- a/hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/AbstractS3ATestBase.java', nan]
['+++ b/hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/AbstractS3ATestBase.java', nan]
['@@ -25,16 +25,20 @@', nan]
['import org.apache.hadoop.fs.contract.AbstractFSContractTestBase;', nan]
['import org.apache.hadoop.fs.contract.ContractTestUtils;', nan]
['import 

['import java.util.EnumSet;', nan]
['-import java.util.UUID;', nan]
['-import java.util.concurrent.Callable;', nan]
['+', nan]
[nan, nan]
['import static org.apache.hadoop.fs.contract.ContractTestUtils.*;', nan]
['-import static org.apache.hadoop.fs.s3a.Constants.S3_METADATA_STORE_IMPL;', nan]
['import static org.apache.hadoop.fs.s3a.Statistic.*;', nan]
['import static org.apache.hadoop.fs.s3a.S3ATestUtils.*;', nan]
['+import static org.apache.hadoop.fs.s3a.performance.OperationCost.*;', nan]
['import static org.apache.hadoop.test.GenericTestUtils.getTestDir;', nan]
['import static org.apache.hadoop.test.LambdaTestUtils.intercept;', nan]
[nan, nan]
['/**', nan]
['- * Use metrics to assert about the cost of file status queries.', nan]
['- * {@link S3AFileSystem#getFileStatus(Path)}.', nan]
['- * Parameterized on guarded vs raw.', nan]
['+ * Use metrics to assert about the cost of file API calls.', nan]
['+ * Parameterized on guarded vs raw. and directory marker keep vs delete', nan]
['*

['}', nan]
[nan, nan]
['@Test', nan]
['-  public void testCreateCost() throws Throwable {', nan]
['-    describe("Test file creation cost -raw only");', nan]
['+  public void testNeedEmptyDirectoryProbeRequiresList() throws Throwable {', nan]
['S3AFileSystem fs = getFileSystem();', nan]
['-    assume("Unguarded FS only", !fs.hasMetadataStore());', nan]
['-    resetMetricDiffs();', nan]
['-    Path testFile = path("testCreateCost");', nan]
[nan, nan]
['+    intercept(IllegalArgumentException.class, "", () ->', nan]
['+            fs.s3GetFileStatus(new Path("/something"), "/something",', nan]
['+                StatusProbeEnum.HEAD_ONLY, null, true));', nan]
['+  }', nan]
['+  @Test', nan]
['+  public void testCreateCost() throws Throwable {', nan]
['+    describe("Test file creation cost -raw only");', nan]
['+    assumeUnguarded();', nan]
['+    Path testFile = methodPath();', nan]
['// when overwrite is false, the path is checked for existence.', nan]
['-    try (FSDataOutputStream o

['+        !s3guard, () -> readonlyFS.open(emptyFile));', nan]
['if (optIn.isPresent()) {', nan]
['try (FSDataInputStream is = optIn.get()) {', nan]
['Assertions.assertThat(is.read())', nan]
['@@ -461,17 +455,17 @@ public void checkGlobOperations() throws Throwable {', nan]
['describe("Glob Status operations");', nan]
['// baseline: the real filesystem on a subdir', nan]
['globFS(getFileSystem(), subdirFile, null, false, 1);', nan]
['-    // a file fails if not in auth mode', nan]
['-    globFS(readonlyFS, subdirFile, null, !guardedInAuthMode, 1);', nan]
['+    // a file fails if not guarded', nan]
['+    globFS(readonlyFS, subdirFile, null, !s3guard, 1);', nan]
["// empty directories don't fail.", nan]
['-    FileStatus[] st = globFS(readonlyFS, emptyDir, null, !s3guard, 1);', nan]
['+    FileStatus[] st = globFS(readonlyFS, emptyDir, null, false, 1);', nan]
['if (s3guard) {', nan]
['assertStatusPathEquals(emptyDir, st);', nan]
['}', nan]
[nan, nan]
['st = globFS(readonlyFS,', nan]
['

['+  protected <T> T verifyRaw(', nan]
['+      OperationCost cost,', nan]
['+      Callable<T> eval) throws Exception {', nan]
['+    return verifyMetrics(eval, whenRaw(cost));', nan]
['+  }', nan]
['+', nan]
['+  /**', nan]
['+   * Execute {@code S3AFileSystem#innerGetFileStatus(Path, boolean, Set)}', nan]
['+   * for the given probes.', nan]
['+   * expect the specific HEAD/LIST count with a raw FS.', nan]
['+   * @param path path', nan]
['+   * @param needEmptyDirectoryFlag look for empty directory', nan]
['+   * @param probes file status probes to perform', nan]
['+   * @param cost expected cost', nan]
['+   * @return the status', nan]
['+   */', nan]
['+  public S3AFileStatus verifyRawInnerGetFileStatus(', nan]
['+      Path path,', nan]
['+      boolean needEmptyDirectoryFlag,', nan]
['+      Set<StatusProbeEnum> probes,', nan]
['+      OperationCost cost) throws Exception {', nan]
['+    return verifyRaw(cost, () ->', nan]
['+        innerGetFileStatus(getFileSystem(),', nan]
[

['+      s3client.deleteObject(bucket, key);', nan]
['+      return "deleted " + key;', nan]
['+    });', nan]
['+  }', nan]
['+', nan]
['+  /**', nan]
['+   * Issue a HEAD request.', nan]
['+   * @param key', nan]
['+   * @return a description of the object.', nan]
['+   */', nan]
['+  private String head(final String key) throws Exception {', nan]
['+    ObjectMetadata md = exec("HEAD " + key, () ->', nan]
['+        s3client.getObjectMetadata(bucket, key));', nan]
['+    return String.format("Object %s of length %d",', nan]
['+        key, md.getInstanceLength());', nan]
['+  }', nan]
['+', nan]
['+  /**', nan]
['+   * Issue a HEAD request and expect a 404 back.', nan]
['+   * @param key', nan]
['+   * @return the metadata', nan]
['+   */', nan]
['+  private void head404(final String key) throws Exception {', nan]
['+    intercept(FileNotFoundException.class, "",', nan]
['+        "Expected 404 of " + key, () ->', nan]
['+        head(key));', nan]
['+  }', nan]
['+', nan]
['+  /**'

['+ * distributed with this work for additional information', nan]
['+ * regarding copyright ownership.  The ASF licenses this file', nan]
['+ * to you under the Apache License, Version 2.0 (the', nan]
['+ * "License"); you may not use this file except in compliance', nan]
['+ * with the License.  You may obtain a copy of the License at', nan]
['+ *', nan]
['+ *     http://www.apache.org/licenses/LICENSE-2.0', nan]
['+ *', nan]
['+ * Unless required by applicable law or agreed to in writing, software', nan]
['+ * distributed under the License is distributed on an "AS IS" BASIS,', nan]
['+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.', nan]
['+ * See the License for the specific language governing permissions and', nan]
['+ * limitations under the License.', nan]
['+ */', nan]
['+', nan]
['+package org.apache.hadoop.fs.s3a.performance;', nan]
['+', nan]
['+import java.util.ArrayList;', nan]
['+import java.util.Arrays;', nan]
['+import java.util.List;', nan]

['}', nan]
[nan, nan]
['-  /*', nan]
['+  /**', nan]
['* Create DirListingMetadata with two dirs and one file living in directory', nan]
["-   * 'parent'", nan]
["+   * 'parent'.", nan]
['*/', nan]
['private static DirListingMetadata makeTwoDirsOneFile(Path parent) {', nan]
['PathMetadata pathMeta1 = new PathMetadata(', nan]
['diff --git a/hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/tools/AbstractMarkerToolTest.ja', 'a b/hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/tools/AbstractMarkerToolTest.']
['new file mode 100644', nan]
['index 00000000000..00e62d94910', nan]
['--- /dev/null', nan]
['+++ b/hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/tools/AbstractMarkerToolTest.java', nan]
['@@ -0,0 +1,334 @@', nan]
['+/*', nan]
['+ * Licensed to the Apache Software Foundation (ASF) under one', nan]
['+ * or more contributor license agreements.  See the NOTICE file', nan]
['+ * distributed with this work for additional information', nan]
['+

['+      filesUnderBase.add(name);', nan]
['+      return file;', nan]
['+    }', nan]
['+  }', nan]
['+', nan]
['+  /**', nan]
['+   * Create the "standard" test paths.', nan]
['+   * @param fs filesystem', nan]
['+   * @param base base dir', nan]
['+   * @return the details on what was created.', nan]
['+   */', nan]
['+  private CreatedPaths createPaths(FileSystem fs, Path base)', nan]
['+      throws IOException {', nan]
['+    CreatedPaths r = new CreatedPaths(fs, base);', nan]
['+    // the directories under which we will create files,', nan]
['+    // so expect to have markers', nan]
['+    r.mkdir("");', nan]
['+', nan]
['+    // create the empty dirs', nan]
['+    r.emptydir("empty");', nan]
['+', nan]
['+    // dir 1 has a file underneath', nan]
['+    r.mkdir("dir1");', nan]
['+    expectedFileCount = r.files("dir1/file1");', nan]
['+', nan]
['+    expectedMarkersUnderDir1 = 1;', nan]
['+', nan]
['+', nan]
['+    // dir2 has a subdir', nan]
['+    r.dirs("dir2", "dir2/dir3")